diff --git a/config/prismjs/dvc-commands.js b/config/prismjs/dvc-commands.js index 845f8dcc68..ece849ebf6 100644 --- a/config/prismjs/dvc-commands.js +++ b/config/prismjs/dvc-commands.js @@ -7,6 +7,8 @@ module.exports = [ 'unfreeze', 'tag', 'status', + 'stage add', + 'stage', 'run', 'root', 'repro', diff --git a/content/authors/jeny_defigueiredo.md b/content/authors/jeny_defigueiredo.md new file mode 100644 index 0000000000..dcf00954ec --- /dev/null +++ b/content/authors/jeny_defigueiredo.md @@ -0,0 +1,8 @@ +--- +name: Jeny De Figueiredo +avatar: jeny_defigueiredo.png +links: + - https://twitter.com/jendefig +--- + +Community Manager at [DVC](https://dvc.org) diff --git a/content/blog/2021-02-16-february-21-dvc-heartbeat.md b/content/blog/2021-02-16-february-21-dvc-heartbeat.md new file mode 100644 index 0000000000..5c54bb19fc --- /dev/null +++ b/content/blog/2021-02-16-february-21-dvc-heartbeat.md @@ -0,0 +1,177 @@ +๏ปฟ--- +title: February '21 Heartbeat +date: 2021-02-16 +description: | + Monthly updates are here! Read all about our growing team, + our CEO's interview on The New Stack, integration with spaCy and more! +descriptionLong: | + Monthly updates are here! Read all about our growing team, + our CEO's interview on The New Stack, integration with spaCy and more! +picture: 2021-02-16/feb21cover.png +author: jeny_defigueiredo +commentsUrl: https://discuss.dvc.org/t/february-21-heartbeat/669 +tags: + - Heartbeat + - CML + - DVC + - DAGsHub + - spaCy + - ML Summit 2021 + - Spell + - MLOps +--- + +## News + +Happy February! Here's all the news to keep you up to date. + +## We've hired and are still hiring! + +We have four new team members this month! + +[**Dave Berenbaum**](https://www.linkedin.com/in/david-berenbaum-20b6b424/) came +to Iterative.ai by way of a +[previous contribution](https://github.com/iterative/dvc/pull/2107) to our open +source products while working as a Data Science Manager at Captial One. He joins +the team as a Technical Product Manager. We are thrilled he's here! + +[**Batuhan Taskaya**](https://www.linkedin.com/in/batuhan-osman-taskaya-7803b61a0/) +joins us as a DVC Software Engineer working on the Python core. Batuhan is +excited to work on open source full time and we are excited to have him do so! + +[**Jeny De Figueiredo**](https://www.linkedin.com/in/jenifer-de-figueiredo/) is +involved in the Seattle area data science community at Data Circles and is a +WiDS Puget Sound Ambassador. She joins us as our new Community Manager and is +looking forward to further building and engaging the community in MLOps! (Hi! +This is me. ๐Ÿ™‹๐Ÿปโ€โ™€๏ธ I'll be writing Heartbeat!) + +[**Roger Parent**](https://www.linkedin.com/in/rogermparent/) has already been a +big part of building DVC and [CML](https://cml.dev/). He has been a primary +developer of a UI that interfaces with the DVC Python application to provide an +interface with the Experiments feature that's coming out with DVC 2.0. We are so +excited to have him joining us full time as Software Engineer. + +![Search](https://media.giphy.com/media/vAvWgk3NCFXTa/giphy.gif) + +## Open Positions + +We are on the hunt for a +[TypeScript Front-End Engineer](https://docs.google.com/document/d/1aT5HZYt4kAUxXqD4JNTe3jPDlVUwSmnEWDPR2QoKdvo/edit) +to build SaaS and a VS Code UI for our popular machine learning tools: DVC and +CML. The ML tools ecosystem is what JS space was 10 years ago. Come join us on +this exciting project! + +Our search continues for a +[Developer Advocate](https://weworkremotely.com/remote-jobs/iterative-developer-advocate) +to support and inspire developers by creating new content like blogs, tutorials, +and videos - plus lead outreach through meetups and conferences. + +Does this sound like you or someone you know? Be in touch! + +## Iterative.ai Featured on The New Stack + +[Susan Hall](https://thenewstack.io/author/susanhall/) of +[The New Stack.io](https://thenewstack.io/) interviewed our very own CEO, +[Dmitry Petrov](https://twitter.com/fullstackml), discussing the needs of ML +engineers and how Iterative.ai makes tools to enable version control and CI/CD +for versioning data and ML models. + +> "ML engineers, they still need collaboration. They need GitHub for +> collaboration, they need this CI/CD system to resolve [issues] between each +> other, between the team and productions system." - Dmitry Petrov + + + +## Workshops and Talks + +### Developer Advocacy for Data Science + +So you saw the post further up. ๐Ÿ‘†๐Ÿฝ Curious about developer advocacy or what to +look for in a hire for this position? +[Elle O'Brien](https://twitter.com/drelleobrien) dove into this recently with +[Alexey Grigorev](https://twitter.com/Al_Grigor) (author of a +[Data Science Bookcamp](https://mlbookcamp.com/)) +[in this podcast](https://www.youtube.com/watch?v=jv5W4jXk4P4) on +[DataTalks.club](http://datatalks.club/) You can watch it here below. ๐Ÿ‘‡๐Ÿผ + +https://www.youtube.com/watch?v=jv5W4jXk4P4 + +## From the Community + +As ever, we have much to share from the great citizens of the DVC community. + +### spaCy and DVC Integration + +If your NLP team uses spaCy to manage your projects, with spaCy's release of +v3.0, you can now enjoy DVC integration to manage your workflow like Git! Check +out the [documentation here](https://spacy.io/usage/projects#integrations) to +streamline and track your process! ๐Ÿ† + + + +### DagsHub and DVC Integrations + +This month two great articles came out regarding the integration of DAGsHub and +DVC. First, this article: [Datasets Should Behave Like Git Repo walks you +through the steps to use DVC in your data versioning. The following image shows +the dependencies and how you simply need to do a `dvc update` each time your +dataset or model changes to track the process. + + + +### Did you say "Works Out of the Box?" + +Also from DAGsHub, by CEO [Dean Pleban](https://twitter.com/DeanPlbn), +[Free Dataset & Model Hosting with Zero Configuration - Launching DAGsHub Storage](https://dagshub.com/blog/dagshub-storage-zero-configuration-dataset-model-hosting/) +tells how their new DAGsHub storage is a DVC remote that requires zero +configuration (!) and will allow for team and organization access controls as +well as easy visibility. + +![Friends](https://media.giphy.com/media/Ftz07proVX6Rq/giphy.gif) + +### Model Management and ML Workflow Orchestration with DVC and Apache Airflow ๐Ÿ‡ฉ๐Ÿ‡ช โ—๏ธ + +We're really excited about a German language workshop led by +[Matthias Niehoff](https://twitter.com/matthiasniehoff)! The workshop will be a +part of the ML Summit 2021 taking place April 19-21st, but registration closes +February 18th. So time is ticking. โฐ The Conference is online, but will be in +German. For more info, head here ๐Ÿ‘‰๐Ÿฝ for the +[Workshop Details](https://ml-summit.de/machine-learing/modellmanagement-und-ml-workflow-orchestrierung-mit-dvc-und-apache-airflow/). + +### "_The_ most popular 'N+1' tool used by teams on Spell" + +[Using DVC as a Lightweight Feature Store on Spell](https://spell.ml/blog/using-dvc-with-spell-YBHOChEAACgAaSmV) +by [Aleksey Bilogur](https://twitter.com/ResidentMario) , reviews the process of +using DVC with Spell for managing changing datasets, enabling team-wide data +reproducibility and why Spell fans are DVC fans, and vice versa. ๐Ÿ”„ + +![Fans](https://media.giphy.com/media/GM8PrUsm92hRC/giphy.gif) + +## Tweet Love โค๏ธ + +https://twitter.com/mihail_eric/status/1357014486377324547?s=20 + +You're all caught up! See you at the next Community Gems ๐Ÿ’Ž! + +--- + +_Do you have any use case questions or need support? Join us in +[Discord](https://discord.com/invite/dvwXA2N)!_ + +_Head to the [DVC Forum](https://discuss.dvc.org/) to discuss your ideas and +best practices._ diff --git a/content/blog/2021-02-18-dvc-2-0-pre-release.md b/content/blog/2021-02-18-dvc-2-0-pre-release.md new file mode 100644 index 0000000000..b5650256c2 --- /dev/null +++ b/content/blog/2021-02-18-dvc-2-0-pre-release.md @@ -0,0 +1,564 @@ +--- +title: DVC 2.0 Pre-Release +date: 2021-02-17 +description: | + Today, we're announcing DVC 2.0 pre-release. We'll share lessons from our + journey and how these will be reflected in the coming release. + +descriptionLong: | + The new release is a result of our learning from our users. There are four + major features coming: + + ๐Ÿ”— ML pipeline templating and iterative foreach stages + + ๐Ÿงช Lightweight ML experiments + + ๐Ÿ“ ML model checkpoints + + ๐Ÿ“ˆ Dvc-live - new open-source library for metrics logging + +picture: 2021-02-18/dvc-2-0-pre-release.png +pictureComment: DVC 2.0 Pre-Release +author: dmitry_petrov +commentsUrl: https://discuss.dvc.org/t/dvc-2-0-pre-release/681 +tags: + - Release + - MLOps + - DataOps +--- + +## Install + +First things first. You can install the 2.0 pre-release from the master branch +in our repo (instruction [here](https://dvc.org/doc/install/pre-release)) or +through pip: + +```dvc +$ pip install --upgrade --pre dvc +``` + +## ML pipelines parameterization and foreach stages + +After introducing the multi-stage pipeline file `dvc.yaml`, it was quickly +adopted among our users. The DVC team got tons of positive feedback from them, +as well as feature requests. + +### Pipeline parameters from `vars` + +The most requested feature was the ability to use parameters in `dvc.yaml`. For +example. So, you can pass the same seed value or filename to multiple stages in +the pipeline. + +```yaml +vars: + train_matrix: train.pkl + test_matrix: test.pkl + seed: 20210215 + +... + +stages: + process: + cmd: python process.py \ + --seed ${seed} \ + --train ${train_matrix} \ + --test ${test_matrix} + outs: + - ${test_matrix} + - ${train_matrix} + + ... + + train: + cmd: python train.py ${train_matrix} --seed ${seed} + deps: + - ${train_matrix} +``` + +Also, it gives an ability to localize all important parameters in a single +`vars` block, and play with them. This is a natural thing to do for scenarios +like NLP or when hyperparameter optimization is happening not only in the model +training code but in the data processing as well. + +### Pipeline parameters from params files + +It is quite common to define pipeline parameters in a config file or a +parameters file (like `params.yaml`) instead of in the pipeline file `dvc.yaml` +itself. These parameters defined in `params.yaml` can also be used in +`dvc.yaml`. + +```yaml +# params.yaml +models: + us: + thresh: 10 + filename: 'model-us.hdf5' +``` + +```yaml +# dvc.yaml +stages: + build-us: + cmd: >- + python script.py + --out ${models.us.filename} + --thresh ${models.us.thresh} + outs: + - ${models.us.filename} +``` + +DVC properly tracks params dependencies for each stage starting from the +previous DVC version 1.0. See the +[`--params` option](/doc/command-reference/run#for-displaying-and-comparing-data-science-experiments) +of `dvc run` for more details. + +### Iterating over params with foreach stages + +Iterating over params was a frequently requested feature. Now users can define +multiple similar stages with a templatized command. + +```yaml +stages: + build: + foreach: + gb: + thresh: 15 + filename: 'model-gb.hdf5' + us: + thresh: 10 + filename: 'model-us.hdf5' + do: + cmd: >- + python script.py --out ${item.filename} --thresh ${item.thresh} + outs: + - ${item.filename} +``` + +## Lightweight ML experiments + +DVC uses Git versioning as the basis for ML experiments. This solid foundation +makes each experiment reproducible and accessible from the project's history. +This Git-based approach works very well for ML projects with mature models when +only a few new experiments per day are run. + +However, in more active development when dozens or hundreds of experiments need +to be run in a single day, Git creates overhead โ€” each experiment run requires +additional Git commands `git add/commit`, and comparing all experiments is +difficult. + +We introduce lightweight experiments in DVC 2.0! This is how you can auto-track +ML experiments without any overhead from ML engineers. + +โš ๏ธ Note, our new ML experiment features (`dvc exp`) are experimental in the +coming release. This means that the commands might change a bit in following +minor releases. + +`dvc exp run` can run an ML experiment with a new hyperparameter from +`params.yaml` while `dvc exp diff` shows metrics and params difference: + +```dvc +$ dvc exp run --set-param featurize.max_features=3000 + +Reproduced experiment(s): exp-bb55c +Experiment results have been applied to your workspace. + +$ dvc exp diff +Path Metric Value Change +scores.json auc 0.57462 0.0072197 + +Path Param Value Change +params.yaml featurize.max_features 3000 1500 +``` + +More experiments: + +```dvc +$ dvc exp run --set-param featurize.max_features=4000 +Reproduced experiment(s): exp-9bf22 +Experiment results have been applied to your workspace. + +$ dvc exp run --set-param featurize.max_features=5000 +Reproduced experiment(s): exp-63ee0 +Experiment results have been applied to your workspace. + +$ dvc exp run --set-param featurize.max_features=5000 \ + --set-param featurize.ngrams=3 +Reproduced experiment(s): exp-80655 +Experiment results have been applied to your workspace. +``` + +In the examples above, hyperparamters were changed with the `--set-param` +option, but you can make these changes by modifying the params file instead. In +fact _any code or data files can be changed_ and `dvc exp run` will capture the +variations. + +See all the runs: + +```dvc +$ dvc exp show --no-pager --no-timestamp \ + --include-params featurize.max_features,featurize.ngrams +โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“ +โ”ƒ Experiment โ”ƒ auc โ”ƒ featurize.max_features โ”ƒ featurize.ngrams โ”ƒ +โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ +โ”‚ workspace โ”‚ 0.56359 โ”‚ 5000 โ”‚ 3 โ”‚ +โ”‚ master โ”‚ 0.5674 โ”‚ 1500 โ”‚ 2 โ”‚ +โ”‚ โ”œโ”€โ”€ exp-80655 โ”‚ 0.56359 โ”‚ 5000 โ”‚ 3 โ”‚ +โ”‚ โ”œโ”€โ”€ exp-63ee0 โ”‚ 0.5515 โ”‚ 5000 โ”‚ 2 โ”‚ +โ”‚ โ”œโ”€โ”€ exp-9bf22 โ”‚ 0.56448 โ”‚ 4000 โ”‚ 2 โ”‚ +โ”‚ โ””โ”€โ”€ exp-bb55c โ”‚ 0.57462 โ”‚ 3000 โ”‚ 2 โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +Under the hood DVC uses Git to store the experiments meta-information. A +straight-forward implementation would create visible branches and auto-commit in +them, but that approach would over-pollute the branch namespace very quickly. To +avoid this issue, we introduced custom Git references `exps`, the same way as +GitHub uses custom references `pulls` to track pull requests (this is an +interesting technical topic that deserves a separate blog post). Below you can +see how it works. + +No artificial branches, only custom references `exps` (do not worry if you don't +understand this part - it is an implementation detail): + +```dvc +$ git branch +* master + +$ git show-ref +5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY +5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH +5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655 +f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0 +0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c +9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22 +``` + +The best experiment can be promoted to the workspace and commited to Git. + +```dvc +$ dvc exp apply exp-bb55c +$ git add . +$ git commit -m 'optimize max feature size' +``` + +Alternatively, an experiment can be promoted to a branch (`big_fr_size` branch +in this case): + +```dvc +$ dvc exp branch exp-80655 big_fr_size +Git branch 'big_fr_size' has been created from experiment 'exp-c695f'. +To switch to the new branch run: + + git checkout big_fr_size +``` + +Remove all the experiments that were not used: + +```dvc +$ dvc exp gc --workspace --force +``` + +## Model checkpoints + +ML model checkpoints are an essential part of deep learning. ML engineers prefer +to save the model files (or weights) at checkpoints during a training process +and return back when metrics start diverging or learning is not fast enough. + +The checkpoints create a different dynamic around ML modeling process and need a +special support from the toolset: + +1. Track and save model checkpoints (DVC outputs) periodically, not only the + final result or training epoch. +2. Save metrics corresponding to each of the checkpoints. +3. Reuse checkpoints - warm-start training with an existing model file, + corresponding code, dataset version and metrics. + +This new behaviour is supported in DVC 2.0. Now, DVC can version all your +checkpoints with corresponding code and data. It brings reproducibility of DL +processes to the next level - every checkpoint is reproducible. + +This is how you define checkpoints with live-metrics: + +```dvc +$ dvc stage add -n train \ + -d users.csv -d train.py \ + -p dropout,epochs,lr,process \ + --checkpoint model.h5 \ + --live logs \ + python train.py + +Creating 'dvc.yaml' +Adding stage 'train' in 'dvc.yaml' +``` + +Note, we use `dvc stage add` command instead of `dvc run`. Starting from DVC 2.0 +we extracting all stage specific functionality under `dvc stage` umbrella. +`dvc run` is still working but it wll be deprecated in the following DVC version +(most likely in 3.0). + +Start the training process and interrupt it after 5 epochs: + +```dvc +$ dvc exp run +'users.csv.dvc' didn't change, skipping +Running stage 'train': +> python train.py +... +^CTraceback (most recent call last): +... +KeyboardInterrupt +``` + +Navigate in checkpoints: + +```dvc +$ dvc exp show --no-pager --no-timestamp +โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”“ +โ”ƒ Experiment โ”ƒ step โ”ƒ loss โ”ƒ accuracy โ”ƒ val_loss โ”ƒ โ€ฆ โ”ƒ epochs โ”ƒ โ€ฆ โ”ƒ +โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”ฉ +โ”‚ workspace โ”‚ 4 โ”‚ 2.0702 โ”‚ 0.30388 โ”‚ 2.025 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ master โ”‚ - โ”‚ - โ”‚ - โ”‚ - โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•“ exp-e15bc โ”‚ 4 โ”‚ 2.0702 โ”‚ 0.30388 โ”‚ 2.025 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 5ea8327 โ”‚ 4 โ”‚ 2.0702 โ”‚ 0.30388 โ”‚ 2.025 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ bc0cf02 โ”‚ 3 โ”‚ 2.1338 โ”‚ 0.23988 โ”‚ 2.0883 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ f8cf03f โ”‚ 2 โ”‚ 2.1989 โ”‚ 0.17932 โ”‚ 2.1542 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 7575a44 โ”‚ 1 โ”‚ 2.2694 โ”‚ 0.12833 โ”‚ 2.223 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”œโ”€โ•จ a72c526 โ”‚ 0 โ”‚ 2.3416 โ”‚ 0.0959 โ”‚ 2.2955 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ +``` + +Each of the checkpoint above is a separate experiment with all data, code, +paramaters and metrics. You can use the same `dvc exp apply` command to extract +any of these. + +Another run just continues this process. You can see how accuracy metrics is +increasing - DVC does not remove the model/checkpoint and training code trains +on top of it: + +```dvc +$ dvc exp run +Existing checkpoint experiment 'exp-e15bc' will be resumed +... +^C +KeyboardInterrupt + +$ dvc exp show --no-pager --no-timestamp +โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”“ +โ”ƒ Experiment โ”ƒ step โ”ƒ loss โ”ƒ accuracy โ”ƒ val_loss โ”ƒ โ€ฆ โ”ƒ epochs โ”ƒ โ€ฆ โ”ƒ +โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”ฉ +โ”‚ workspace โ”‚ 9 โ”‚ 1.7845 โ”‚ 0.58125 โ”‚ 1.7381 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ master โ”‚ - โ”‚ - โ”‚ - โ”‚ - โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•“ exp-e15bc โ”‚ 9 โ”‚ 1.7845 โ”‚ 0.58125 โ”‚ 1.7381 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 205a8d3 โ”‚ 9 โ”‚ 1.7845 โ”‚ 0.58125 โ”‚ 1.7381 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ dd23d96 โ”‚ 8 โ”‚ 1.8369 โ”‚ 0.54173 โ”‚ 1.7919 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 5bb3a1f โ”‚ 7 โ”‚ 1.8929 โ”‚ 0.49108 โ”‚ 1.8474 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 6dc5610 โ”‚ 6 โ”‚ 1.951 โ”‚ 0.43433 โ”‚ 1.9046 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ a79cf29 โ”‚ 5 โ”‚ 2.0088 โ”‚ 0.36837 โ”‚ 1.9637 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 5ea8327 โ”‚ 4 โ”‚ 2.0702 โ”‚ 0.30388 โ”‚ 2.025 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ bc0cf02 โ”‚ 3 โ”‚ 2.1338 โ”‚ 0.23988 โ”‚ 2.0883 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ f8cf03f โ”‚ 2 โ”‚ 2.1989 โ”‚ 0.17932 โ”‚ 2.1542 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 7575a44 โ”‚ 1 โ”‚ 2.2694 โ”‚ 0.12833 โ”‚ 2.223 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”œโ”€โ•จ a72c526 โ”‚ 0 โ”‚ 2.3416 โ”‚ 0.0959 โ”‚ 2.2955 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ +``` + +Afrer modifyng code, data or params the same process can be resumed. DVC +recognizes the change and shows it (see experiment `b363267`): + +```dvc +$ vi train.py # modify code +$ vi params.yaml # modify params + +$ dvc exp run +Modified checkpoint experiment based on 'exp-e15bc' will be created +... + +$ dvc exp show --no-pager --no-timestamp +โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”“ +โ”ƒ Experiment โ”ƒ step โ”ƒ loss โ”ƒ accuracy โ”ƒ val_loss โ”ƒ โ€ฆ โ”ƒ epochs โ”ƒ โ€ฆ โ”ƒ +โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”ฉ +โ”‚ workspace โ”‚ 13 โ”‚ 1.5841 โ”‚ 0.69262 โ”‚ 1.5381 โ”‚ โ€ฆ โ”‚ 15 โ”‚ โ€ฆ โ”‚ +โ”‚ master โ”‚ - โ”‚ - โ”‚ - โ”‚ - โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•“ exp-7ff06 โ”‚ 13 โ”‚ 1.5841 โ”‚ 0.69262 โ”‚ 1.5381 โ”‚ โ€ฆ โ”‚ 15 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 6c62fec โ”‚ 12 โ”‚ 1.6325 โ”‚ 0.67248 โ”‚ 1.5857 โ”‚ โ€ฆ โ”‚ 15 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 4baca3c โ”‚ 11 โ”‚ 1.6817 โ”‚ 0.64855 โ”‚ 1.6349 โ”‚ โ€ฆ โ”‚ 15 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ b363267 (2b06de7) โ”‚ 10 โ”‚ 1.7323 โ”‚ 0.61925 โ”‚ 1.6857 โ”‚ โ€ฆ โ”‚ 15 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•“ 2b06de7 โ”‚ 9 โ”‚ 1.7845 โ”‚ 0.58125 โ”‚ 1.7381 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 205a8d3 โ”‚ 9 โ”‚ 1.7845 โ”‚ 0.58125 โ”‚ 1.7381 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ dd23d96 โ”‚ 8 โ”‚ 1.8369 โ”‚ 0.54173 โ”‚ 1.7919 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 5bb3a1f โ”‚ 7 โ”‚ 1.8929 โ”‚ 0.49108 โ”‚ 1.8474 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 6dc5610 โ”‚ 6 โ”‚ 1.951 โ”‚ 0.43433 โ”‚ 1.9046 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ a79cf29 โ”‚ 5 โ”‚ 2.0088 โ”‚ 0.36837 โ”‚ 1.9637 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 5ea8327 โ”‚ 4 โ”‚ 2.0702 โ”‚ 0.30388 โ”‚ 2.025 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ bc0cf02 โ”‚ 3 โ”‚ 2.1338 โ”‚ 0.23988 โ”‚ 2.0883 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ f8cf03f โ”‚ 2 โ”‚ 2.1989 โ”‚ 0.17932 โ”‚ 2.1542 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”‚ โ•Ÿ 7575a44 โ”‚ 1 โ”‚ 2.2694 โ”‚ 0.12833 โ”‚ 2.223 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ”‚ โ”œโ”€โ•จ a72c526 โ”‚ 0 โ”‚ 2.3416 โ”‚ 0.0959 โ”‚ 2.2955 โ”‚ โ€ฆ โ”‚ 5 โ”‚ โ€ฆ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ +``` + +Sometimes you might need to train the model from scratch. The reset option +removes the checkpoint file before training: `dvc exp run --reset`. + +## Metrics logging + +Continuously logging ML metrics is a very common practice in the ML world. +Instead of a simple command line output with the metrics values many ML +engineers prefer visuals and plots. These plots can be organized in a "database" +of ML experiments to keep track of a project. There are many special solutions +for metrics collecting and experiment tracking such as sacred, mlflow, weight +and biases, neptune.ai or other. + +With DVC 2.0 we are releasing new open-source library +[DVC-Live](https://github.com/iterative/dvclive) that provides functionality for +tracking model metrics and organizing metrics in simple text files in a way that +DVC can visualize the metrics with navigation in Git histroy. So, DVC can show +you a metrics difference between current model and a model in `master` or any +other branch. + +This approach is similar to the other metrics tracking tools with the difference +that Git becomes a "database" or of ML experiments. + +### Generate metrics file + +Install the library: + +```dvc +$ pip install dvclive +``` + +Instrument your code: + +```python +import dvclive +from dvclive.keras import DvcLiveCallback + +dvclive.init("logs") #, summarize=True) + +... + +model.fit(... + # Set up DVC-Live callback: + callbacks=[ DvcLiveCallback() ] + ) + +``` + +During the training you will see the metrics files that are continiously +populated each epoches: + +```dvc +$ ls logs/ +accuracy.tsv loss.tsv val_accuracy.tsv val_loss.tsv + +$ head logs/accuracy.tsv +timestamp step accuracy +1613645582716 0 0.7360000014305115 +1613645585478 1 0.8349999785423279 +1613645587322 2 0.8830000162124634 +1613645589125 3 0.9049999713897705 +1613645590891 4 0.9070000052452087 +1613645592681 5 0.9279999732971191 +1613645594490 6 0.9430000185966492 +1613645596232 7 0.9369999766349792 +1613645598034 8 0.9430000185966492 +``` + +In addition to the continious metrics files you will see the summary metrics +file and html file with the same file prefix. The summary file contains the +result of the latest epoch: + +```dvc +$ cat logs.json | python -m json.tool +{ + "step": 41, + "loss": 0.015958430245518684, + "accuracy": 0.9950000047683716, + "val_loss": 13.705962181091309, + "val_accuracy": 0.5149999856948853 +} +``` + +The html file contains all the visuals for continuous metrics as well as the +summary metrics in a single page: + +![](/uploads/images/2021-02-18/dvclive-html.png) + +Note, the HTML and the summary metrics files are generating automatically for +each. So, you can monitor model performance in realtime. + +### Git-navigation with the metrics file + +DVC repository is NOT required to use the live metrics functionality from the +above. It works independently from DVC. + +DVC repository become useful when the metrics and plots are commited in your Git +repository and you need navigation around the metrics. + +Metrics difference between workspace and the last Git commit: + +```dvc +$ git status -s + M logs.json + M logs/accuracy.tsv + M logs/loss.tsv + M logs/val_accuracy.tsv + M logs/val_loss.tsv + M train.py +?? model.h5 + +$ dvc metrics diff --target logs.json +Path Metric Old New Change +logs.json accuracy 0.995 0.99 -0.005 +logs.json loss 0.01596 0.03036 0.0144 +logs.json step 41 36 -5 +logs.json val_accuracy 0.515 0.5175 0.0025 +logs.json val_loss 13.70596 3.29033 -10.41563 +``` + +The difference between a particular commit/branch/tag or between two commits: + +```dvc +$ dvc metrics diff --target logs.json HEAD^ 47b85c +Path Metric Old New Change +logs.json accuracy 0.995 0.998 0.003 +logs.json loss 0.01596 0.01951 0.00355 +logs.json step 41 82 41 +logs.json val_accuracy 0.515 0.51 -0.005 +logs.json val_loss 13.70596 5.83056 -7.8754 +``` + +The same Git-navigation works with the plots: + +```dvc +$ dvc plots diff --target logs +file:///Users/dmitry/src/exp-dc/plots.html +``` + +![](/uploads/images/2021-02-18/dvclive-diff-html.png) + +Another nice thing about the live metrics - they work across ML experiments and +checkpoints if properly set up in dvc stages. To set up live metrics you need to +specify the metrics directory in `live` section of a stage: + +```yaml +stages: + train: + cmd: python train.py + live: + logs: + cache: false + summary: true + report: true + deps: + - data +``` + +## Thank you! + +I'd like to thank all of you DVC community members for the feedback that we are +constantly getting. This feedback helps us build new functionalities in DVC and +make it more stable. + +Please be in touch with us on [Twitter](https://twitter.com/DVCorg) and our +[Discord channel](https://dvc.org/chat). diff --git a/content/docs/command-reference/stage/add.md b/content/docs/command-reference/stage/add.md new file mode 100644 index 0000000000..270f64887f --- /dev/null +++ b/content/docs/command-reference/stage/add.md @@ -0,0 +1,427 @@ +# stage add + +Helper command to create or update stages in `dvc.yaml`. + +## Synopsis + +```usage +usage: dvc stage add [-h] [-q | -v] -n [-d ] [-o ] + [-O ] [-p [:]] + [-m ] [-M ] [--plots ] + [--plots-no-cache ] [-w ] [-f] + [--outs-persist ] + [--outs-persist-no-cache ] + [--always-changed] [--external] [--desc ] + command + +positional arguments: + command Command to execute +``` + +## Description + +Creates or updates stages in a [pipeline](/doc/command-reference/dag) (saved to +`dvc.yaml` in the current working directory). + +A stage name is required and can be provided using the `-n` (`--name`) option. +Most of the other [options](#options) help with defining different kinds of +[dependencies and outputs](#dependencies-and-outputs) for the stage. The +remaining terminal input provided to `dvc stage add` after `-`/`--` flags will +become the required [`command` argument](#the-command-argument). + +`dvc repro` can be used to execute pipelines after their stages have been +defined. + +
+ +### ๐Ÿ’ก Avoiding unexpected behavior + +We don't want to tell anyone how to write their code or what programs to use! +However, please be aware that in order to prevent unexpected results when DVC +reproduces pipeline stages, the underlying code should ideally follow these +rules: + +- Read/write exclusively from/to the specified dependencies and + outputs (including parameters files, metrics, and plots). + +- Completely rewrite outputs. Do not append or edit. + +- Stop reading and writing files when the `command` exits. + +Also, if your pipeline reproducibility goals include consistent output data, its +code should be +[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (produce +the same output for any given input): avoid code that increases +[entropy](https://en.wikipedia.org/wiki/Software_entropy) (e.g. random numbers, +time functions, hardware dependencies, etc.). + +
+ +### The `command` argument + +The `command` sent to `dvc stage add` can be anything your terminal would accept +and run directly, for example a shell built-in, expression, or binary found in +`PATH`. Please remember that any flags sent after the `command` are considered +part of the command itself, not of `dvc stage add`. + +โš ๏ธ While DVC is platform-agnostic, the commands defined in your +[pipeline](/doc/command-reference/dag) stages may only work on some operating +systems and require certain software packages to be installed. + +Wrap the command with double quotes `"` if there are special characters in it +like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to +`dvc stage add` itself. Use single quotes `'` instead if there are environment +variables in it that should be evaluated dynamically. Examples: + +```dvc +$ dvc stage add -n first_stage "./a_script.sh > /dev/null 2>&1" +$ dvc stage add -n second_stage './another_script.sh $MYENVVAR' +``` + +### Dependencies and outputs + +By specifying lists of dependencies (`-d` option) and/or +outputs (`-o` and `-O` options) for each stage, we can create a +_dependency graph_ ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) +that connects them, i.e. the output of a stage becomes the input of another, and +so on (see `dvc dag`). This graph can be restored by DVC later to modify or +[reproduce](/doc/command-reference/repro) the full pipeline. For example: + +```dvc +$ dvc stage add -n printer -d write.sh -o pages ./write.sh +$ dvc stage add -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages +``` + +Stage dependencies can be any file or directory, either untracked, or more +commonly tracked by DVC or Git. Outputs will be tracked and cached +by DVC when the stage is run. Every output version will be cached when the stage +is reproduced (see also `dvc gc`). + +Relevant notes: + +- Typically, scripts to run (or possibly a directory containing the source code) + are included among the specified `-d` dependencies. This ensures that when the + source code changes, DVC knows that the stage needs to be reproduced. (You can + chose whether to do this.) + +- `dvc stage add` checks the dependency graph integrity before creating a new + stage. For example: two stage cannot specify the same output or overlapping + output paths, there should be no cycles, etc. + +- DVC does not feed dependency files to the command being run. The program will + have to read by itself the files specified with `-d`. + +- Entire directories produced by the stage can be tracked as outputs by DVC, + which generates a single `.dir` entry in the cache (refer to + [Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) + for more info.) + +- [external dependencies](/doc/user-guide/external-dependencies) and + [external outputs](/doc/user-guide/managing-external-data) (outside of the + workspace) are also supported (except metrics and plots). + +- Outputs are deleted from the workspace before executing the command (including + at `dvc repro`) if their paths are found as existing files/directories (unless + `--outs-persist` is used). This also means that the stage command needs to + recreate any directory structures defined as outputs every time its executed + by DVC. + +- In some situations, we have previously executed a stage, and later notice that + some of the files/directories used by the stage as dependencies, or created as + outputs are missing from `dvc.yaml`. It is possible to + [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) + without having to execute it again. + +- Renaming dependencies or outputs requires a + [manual process](/doc/command-reference/move#renaming-stage-outputs) to update + `dvc.yaml` and the project's cache accordingly. + +### For displaying and comparing data science experiments + +[parameters](/doc/command-reference/params) (`-p`/`--params` option) are a +special type of key/value dependencies. Multiple parameter dependencies can be +specified from within one or more YAML, JSON, TOML, or Python parameters files +(e.g. `params.yaml`). This allows tracking experimental hyperparameters easily. + +Special types of output files, [metrics](/doc/command-reference/metrics) (`-m` +and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and +`--plots-no-cache` options), are also supported. Metrics and plots files have +specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing +data science experiments. + +## Options + +- `-n `, `--name ` (**required**) - specify a name for the stage + generated by this command (e.g. `-n train`). Stage names can only contain + letters, numbers, dash `-` and underscore `_`. + +- `-d `, `--deps ` - specify a file or a directory the stage depends + on. Multiple dependencies can be specified like this: + `-d data.csv -d process.py`. Usually, each dependency is a file or a directory + with data, or a code file, or a configuration file. DVC also supports certain + [external dependencies](/doc/user-guide/external-dependencies). + + When you use `dvc repro`, the list of dependencies helps DVC analyze whether + any dependencies have changed and thus executing stages required to regenerate + their outputs. + +- `-o `, `--outs ` - specify a file or directory that is the result + of running the `command`. Multiple outputs can be specified: + `-o model.pkl -o output.log`. DVC builds a dependency graph (pipeline) to + connect different stages with each other based on this list of outputs and + dependencies (see `-d`). DVC tracks all output files and directories and puts + them into the cache (this is similar to what's happening when you use + `dvc add`). + +- `-O `, `--outs-no-cache ` - the same as `-o` except that outputs + are not tracked by DVC. This means that they are never cached, so it's up to + the user to manage them separately. This is useful if the outputs are small + enough to be tracked by Git directly; or large, yet you prefer to regenerate + them every time (see `dvc repro`); or unwanted in storage for any other + reason. + +- `--outs-persist ` - declare output file or directory that will not be + removed when `dvc repro` starts (but it can still be modified, overwritten, or + even deleted by the stage command(s)). + +- `--outs-persist-no-cache ` - the same as `-outs-persist` except that + outputs are not tracked by DVC (same as with `-O` above). + +- `-p [:]`, `--params [:]` - specify a set + of [parameter dependencies](/doc/command-reference/params) the stage depends + on, from a parameters file. This is done by sending a comma separated list as + argument, e.g. `-p learning_rate,epochs`. The default parameters file name is + `params.yaml`, but this can be redefined with a prefix in the argument sent to + this option, e.g. `-p parse_params.yaml:threshold`. See `dvc params` to learn + more about parameters. + +- `-m `, `--metrics ` - specify a metrics file produced by this + stage. This option behaves like `-o` but registers the file in a `metrics` + field inside the `dvc.yaml` stage. Metrics are usually small, human readable + files (JSON or YAML) with scalar numbers or other simple information that + describes a model (or any other data artifact). See `dvc metrics` to learn + more about _metrics_. + +- `-M `, `--metrics-no-cache ` - the same as `-m` except that DVC + does not track the metrics file (same as with `-O` above). This means that + they are never cached, so it's up to the user to manage them separately. This + is typically desirable with _metrics_ because they are small enough to be + tracked with Git directly. + +- `--plots ` - specify a plot metrics file produces by this stage. This + option behaves like `-o` but registers the file in a `plots` field inside the + `dvc.yaml` stage. Plot metrics are data series stored in tabular (CSV or TSV) + or hierarchical (JSON or YAML) files, with complex information that describes + a model (or any other data artifact). See `dvc plots` to learn more about + plots. + +- `--plots-no-cache ` - the same as `--plots` except that DVC does not + track the plots file (same as with `-O` and `-M` above). This may be desirable + with _plots_, if they are small enough to be tracked with Git directly. + +- `-w `, `--wdir ` - specifies a working directory for the `command` + to run in (uses the `wdir` field in `dvc.yaml`). Dependency and output files + (including metrics and plots) should be specified relative to this directory. + It's used by `dvc repro` to change the working directory before executing the + `command`. + +- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without + asking for confirmation. + +- `--always-changed` - always consider this stage as changed (uses the + `always_changed` field in `dvc.yaml`). As a result `dvc status` will report it + as `always changed` and `dvc repro` will always execute it. + + > Note that regular `.dvc` files (without dependencies) are automatically + > considered "always changed", so this option has no effect in those cases. + +- `--external` - allow writing outputs outside of the DVC repository. See + [Managing External Data](/doc/user-guide/managing-external-data). + +- `--desc ` - user description of the stage (optional). This doesn't + affect any DVC operations. + +- `-h`, `--help` - prints the usage/help message, and exit. + +- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no + problems arise, otherwise 1. + +- `-v`, `--verbose` - displays detailed tracing information. + +## Examples + +Let's create a DVC project and a stage (that counts the number of +lines in a `test.txt` file): + +```dvc +$ mkdir example && cd example +$ git init +$ dvc init +$ mkdir data +$ dvc stage add -n count \ + -d test.txt \ + -o lines \ + "cat test.txt | wc -l > lines" +Creating 'dvc.yaml' +Adding stage 'count' in 'dvc.yaml' + +To track the changes with git, run: + + git add .gitignore dvc.yaml + +$ tree +. +โ”œโ”€โ”€ dvc.yaml +โ””โ”€โ”€ test.txt +``` + +This results in the following stage entry in `dvc.yaml`: + +```yaml +stages: + count: + cmd: 'cat test.txt | wc -l > lines' + deps: + - test.txt + outs: + - lines +``` + +There's no `lines` file in the workspace as the stage is not run yet. It'll be +created and tracked whenever `dvc repro` is run. + +## Example: Overwrite an existing stage + +The following stage runs a Python script that trains an ML model on the training +dataset (`20180226` is a seed value): + +```dvc +$ dvc stage add -n train \ + -d train_model.py -d matrix-train.p -o model.p \ + python train_model.py 20180226 model.p +``` + +To update a stage that is already defined, the `-f` (`--force`) option is +needed. Let's update the seed for the `train` stage: + +```dvc +$ dvc stage add -n train --force \ + -d train_model.p -d matrix-train.p -o model.p \ + python train_model.py 18494003 model.p +``` + +## Example: Separate stages in a subdirectory + +Let's move to a subdirectory and create a stage there. This generates a separate +`dvc.yaml` file in that location. The stage command itself counts the lines in +`test.txt` and writes the number to `lines`. + +```dvc +$ cd more_stages/ +$ dvc stage add -n process_data \ + -d data.in \ + -o result.out \ + ./my_script.sh data.in result.out +$ tree .. +. +โ”œโ”€โ”€ dvc.yaml +โ”œโ”€โ”€ dvc.lock +โ”œโ”€โ”€ file1 +โ”œโ”€โ”€ ... +โ””โ”€โ”€ more_stages/ + โ”œโ”€โ”€ data.in + โ””โ”€โ”€ dvc.yaml +``` + +## Example: Chaining stages + +DVC [pipelines](/doc/command-reference/dag) are constructed by connecting the +outputs of a stage to the dependencies of the following one(s). + +Let's create a stage that extracts an XML file from an archive to the `data/` +folder: + +```dvc +$ mkdir data +$ dvc stage add -n extract \ + -d Posts.xml.zip \ + -o data/Posts.xml \ + unzip Posts.xml.zip -d data/ +``` + +> Note that the last `-d` applies to the stage's command (`unzip`), not to +> `dvc stage add`. + +Also, let's add another stage that executes an R script that parses the XML +file: + +```dvc +$ dvc stage add -n parse \ + -d parsingxml.R -d data/Posts.xml \ + -o data/Posts.csv \ + Rscript parsingxml.R data/Posts.xml data/Posts.csv +``` + +These stages are not run yet, so there are no outputs. But we can still see how +they are connected into a pipeline (given their outputs and dependencies) with +`dvc dag`: + +```dvc +$ dvc dag ++---------+ +| extract | ++---------+ + * + * + * ++---------+ +| parse | ++---------+ +``` + +We can use `dvc repro` to execute this pipeline to get the outputs. + +## Example: Using parameter dependencies + +To use specific values inside a parameters file as dependencies, create a simple +YAML file named `params.yaml` (default params file name, see `dvc params` to +learn more): + +```yaml +seed: 20180226 + +train: + lr: 0.0041 + epochs: 75 + layers: 9 + +processing: + threshold: 0.98 + bow_size: 15000 +``` + +Define a stage with both regular dependencies as well as parameter dependencies: + +```dvc +$ dvc stage add -n train \ + -d train_model.py -d matrix-train.p -o model.p \ + -p seed,train.lr,train.epochs + python train_model.py 20200105 model.p +``` + +`train_model.py` will include some code to open and parse the parameters: + +```py +import yaml + +with open("params.yaml", 'r') as fd: + params = yaml.safe_load(fd) + +seed = params['seed'] +lr = params['train']['lr'] +epochs = params['train']['epochs'] +``` + +DVC will keep an eye on these param values (same as with the regular dependency +files) and know that the stage should be reproduced if/when they change. See +`dvc params` for more details. diff --git a/content/docs/command-reference/stage/index.md b/content/docs/command-reference/stage/index.md new file mode 100644 index 0000000000..d9c4d98fe4 --- /dev/null +++ b/content/docs/command-reference/stage/index.md @@ -0,0 +1,25 @@ +# stage + +A set of commands to add and list stages: +[add](/doc/command-reference/stage/add). + +## Synopsis + +```usage +usage: dvc stage [-h] [-q | -v] {add,list} ... + +positional arguments: + COMMAND + add Create stage. + list List stages. +``` + +## Description + +_Stages_ represent individual data processes, including their input and +resulting outputs. They can be combined to capture simple data workflows, +organize data science projects, or build detailed machine learning pipelines. + +`dvc stage add` can be used to create/update stages in the `dvc.yaml` file. +Similarly, `dvc stage list` helps listing the stages present in the `dvc.yaml` +file. diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 9b8157707e..78b66e45c7 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -6,8 +6,8 @@ repositories, and the corresponding import stage `.dvc` files. ## Synopsis ```usage -usage: dvc update [-h] [-q | -v] [--rev ] [-R] - targets [targets ...] +usage: dvc update [-h] [-q | -v] [--rev ] [-R] [--to-remote] + [-r ] [-j ] targets [targets ...] positional arguments: targets Import stage .dvc files to update. Using -R, directories @@ -49,6 +49,20 @@ $ dvc update --rev master directory and its subdirectories for import stage `.dvc` files to inspect. If there are no directories among the targets, this option is ignored. +- `--to-remote` - update the import `.dvc` file and + [transfer](/doc/command-reference/import-url#example-import-straight-to-the-remote) + the new data directly to remote storage (the default one unless `-r` is used). + No changes are done in the workspace. Use `dvc pull` to get the + data locally. + +- `-r `, `--remote ` - name of the + [remote storage](/doc/command-reference/remote) (can only be used with + `--to-remote`). + +- `-j `, `--jobs ` - parallelism level for DVC to download data + from the source. The default value is `4 * cpu_count()`. For SSH remotes, the + default is `4`. Using more jobs may speed up the operation. + - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no diff --git a/content/docs/index.md b/content/docs/index.md index 88d0303e6b..96acf10c81 100644 --- a/content/docs/index.md +++ b/content/docs/index.md @@ -1,8 +1,9 @@ # DVC Documentation -Data Version Control, or DVC, is a data and ML experiment management tool that -takes advantage of the existing engineering toolset that you're already familiar -with (Git, CI/CD, etc.). +Data Version Control, or DVC, is a data and ML +[experiment management](/doc/user-guide/experiment-management) tool that takes +advantage of the existing engineering toolset that you're already familiar with +(Git, CI/CD, etc.). diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index e6fa86237d..73a436f487 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -127,6 +127,7 @@ "merge-conflicts" ] }, + "experiment-management", "setup-google-drive-remote", "large-dataset-optimization", "external-dependencies", @@ -360,6 +361,17 @@ "label": "run", "slug": "run" }, + { + "label": "stage", + "slug": "stage", + "source": "stage/index.md", + "children": [ + { + "label": "stage add", + "slug": "add" + } + ] + }, { "label": "status", "slug": "status" diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 33b270bcc5..0cc46a29ec 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -14,6 +14,9 @@ Read on or watch our video to see how it's done! https://youtu.be/iduHPtBncBk +> ๐Ÿ“– See [Experiment Management](/doc/user-guide/experiment-management) for more +> information on DVC's approach. + ## Collecting metrics First, let's see what is the mechanism to capture values for these ML experiment diff --git a/content/docs/user-guide/basic-concepts/experiment.md b/content/docs/user-guide/basic-concepts/experiment.md new file mode 100644 index 0000000000..e4130a9683 --- /dev/null +++ b/content/docs/user-guide/basic-concepts/experiment.md @@ -0,0 +1,11 @@ +--- +name: Experiment +match: [experiment, experiments] +tooltip: >- + An attempt to reach desired/better/interesting results during data pipelining + or ML model development. DVC is designed to help [manage + experiments](/doc/user-guide/experiment-management), having built-in + mechanisms like the + [run-cache](/doc/user-guide/project-structure/internal-files#run-cache) and + the `dvc experiments` commands (coming in DVC 2.0). +--- diff --git a/content/docs/user-guide/basic-concepts/run-cache.md b/content/docs/user-guide/basic-concepts/run-cache.md index 53f50dff85..148e1a0378 100644 --- a/content/docs/user-guide/basic-concepts/run-cache.md +++ b/content/docs/user-guide/basic-concepts/run-cache.md @@ -2,10 +2,10 @@ name: 'Run-cache' match: ['run-cache'] tooltip: >- - The DVC run-cache is a log of stages that have been run in the project. It's - comprised of `dvc.lock` file backups, identified as combinations of - dependencies, commands, and outputs that correspond to each other. `dvc repro` - and `dvc run` populate and reutilize the run-cache. See + A log of stages that have been run in the project. It's comprised of + `dvc.lock` file backups, identified as combinations of dependencies, commands, + and outputs that correspond to each other. `dvc repro` and `dvc run` populate + and reutilize the run-cache. See [Run-cache](/doc/user-guide/project-structure/internal-files#run-cache) for more details. --- diff --git a/content/docs/user-guide/basic-concepts/stage.md b/content/docs/user-guide/basic-concepts/stage.md new file mode 100644 index 0000000000..e73027b460 --- /dev/null +++ b/content/docs/user-guide/basic-concepts/stage.md @@ -0,0 +1,8 @@ +--- +name: Stage +match: [stage, stages] +tooltip: >- + A stage represents individual data processes, including their input and + resulting output which can be combined to build detailed machine learning + pipelines. +--- diff --git a/content/docs/user-guide/experiment-management.md b/content/docs/user-guide/experiment-management.md new file mode 100644 index 0000000000..7cda085317 --- /dev/null +++ b/content/docs/user-guide/experiment-management.md @@ -0,0 +1,138 @@ +# Experiment Management + +Data science and ML are iterative processes that require a large number of +attempts to reach a certain level of a metric. Experimentation is part of the +development of data features, hyperspace exploration, deep learning +optimization, etc. DVC helps you codify and manage all of your +experiments, supporting these main approaches: + +1. Create [experiments](#experiments) that derive from your latest project + version without having to track them manually. DVC does that automatically, + letting you list and compare them. The best ones can be promoted, and the + rest archived. +2. Place in-code [checkpoints](#checkpoints-in-source-code) that mark a series + of variations, forming an in-depth experiment. DVC helps you capture them at + runtime, and manage them in batches. +3. Apply experiments or checkpoints as [persistent](#persistent-experiments) + commits in your repository. Or create these versions from + scratch like typical project changes. + + At this point you may also want to consider the different + [ways to organize](#organization-patterns) experiments in your project (as + Git branches, as folders, etc.). + +DVC also provides specialized features to codify and analyze experiments. +[Parameters](/doc/command-reference/params) are simple values you can tweak in a +human-readable text file, which cause different behaviors in your code and +models. On the other end, [metrics](/doc/command-reference/metrics) (and +[plots](/doc/command-reference/plots)) let you define, visualize, and compare +meaningful measures for the experimental results. + +## Experiments + +โš ๏ธ This feature is only available in DVC 2.0 โš ๏ธ + +`dvc exp` commands let you automatically track a variation to an established +[data pipeline](/doc/command-reference/dag). You can create multiple isolated +experiments this way, as well as review, compare, and restore them later, or +roll back to the baseline. The basic workflow goes like this: + +- Modify dependencies (e.g. input data or source code), + hyperparameters, or commands (`cmd` field of `dvc.yaml`) of + committed stages. +- Use `dvc exp run` (instead of `repro`) to execute the pipeline. This puts the + experiment's results in your workspace, and tracks it under the + hood. +- Visualize experiment configurations and results with `dvc exp show`. Repeat. +- Use [metrics](/doc/command-reference/metrics) in your pipeline to identify the + best experiment(s), and promote them to persistent experiments (regular + commits) with `dvc exp apply`. + +
+ +### How does DVC track experiments? + +DVC uses actual commits under custom +[Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +(found in `.git/refs/exps`) to keep track of experiments created with `dvc exp`. +Each commit has the repo `HEAD` as parent. These are not pushed to the Git +remote by default (see `dvc exp push`). + +> References have a unique signature similar to the +> [entries in the run-cache](/doc/user-guide/project-structure/internal-files#run-cache). + +
+ +## Checkpoints in source code + +โš ๏ธ This feature is only available in DVC 2.0 โš ๏ธ + +To track successive steps in a longer experiment, you can write your code so it +registers checkpoints with DVC at runtime. This allows you, for example, to +track the progress in deep learning techniques such as evolving neural networks. + +This kind of experiment is also derived fom your latest project version, but it +tracks a series of variations (the checkpoints). You interact with them using +`dvc exp run`, `dvc exp resume`, and `dvc exp reset` (see also the `checkpoint` +field of `dvc.yaml` outputs). + +
+ +### How are checkpoints captured by DVC? + +When DVC runs a checkpoint-enabled pipeline, a custom Git branch (in +`.git/refs/exps`) is started off the repo `HEAD`. A new commit is appended each +time the code calls `dvc.api.make_checkpoint()` or writes a +`.dvc/tmp/DVC_CHECKPOINT` signal file. These are not pushed to the Git remote by +default (see `dvc exp push`). + +
+ +## Persistent experiments + +When your experiments are good enough to save or share, you may want to store +them persistently as commits in your repository. + +Whether the results were produced with `dvc repro` directly, or after a +`dvc exp` workflow (refer to previous sections), the `dvc.yaml` and `dvc.lock` +pair in the workspace will codify the experiment as a new project +version. The right outputs (including +[metrics](/doc/command-reference/metrics)) should also be present, or available +via `dvc checkout`. + +> ๐Ÿ‘จโ€๐Ÿ’ป See [Get Started: Experiments](/doc/start/experiments) for a hands-on +> introduction to regular experiments. + +### Organization patterns + +DVC takes care of arranging `dvc exp` experiments and the data +cache under the hood. But when it comes to full-blown persistent +experiments, it's up to you to decide how to organize them in your project. +These are the main alternatives: + +- **Git tags and branches** - use the repo's "time dimension" to distribute your + experiments. This makes the most sense for experiments that build on each + other. Helpful if the Git [revisions](https://git-scm.com/docs/revisions) can + be easily visualized, for example with tools + [like GitHub](https://docs.github.com/en/github/visualizing-repository-data-with-graphs/viewing-a-repositorys-network). +- **Directories** - the project's "space dimension" can be structured with + directories (folders) to organize experiments. Useful when you want to see all + your experiments at the same time (without switching versions) by just + exploring the file system. +- **Hybrid** - combining an intuitive directory structure with a good repo + branching strategy tends to be the best option for complex projects. + Completely independent experiments live in separate directories, while their + progress can be found in different branches. + +## Automatic log of stage runs (run-cache) + +Every time you `dvc repro` pipelines or `dvc exp run` experiments, DVC logs the +unique signature of each stage run (to `.dvc/cache/runs` by default). If it +never happened before, the stage command(s) are executed normally. Every +subsequent time a [stage](/doc/command-reference/run) runs under the same +conditions, the previous results can be restored instantly, without wasting time +or computing resources. + +โœ… This built-in feature is called run-cache and it can +dramatically improve performance. It's enabled out-of-the-box (but can be +disabled with the `--no-run-cache` command option). diff --git a/content/docs/user-guide/project-structure/internal-files.md b/content/docs/user-guide/project-structure/internal-files.md index f130a2690b..3c107c5e85 100644 --- a/content/docs/user-guide/project-structure/internal-files.md +++ b/content/docs/user-guide/project-structure/internal-files.md @@ -131,9 +131,10 @@ That's how DVC knows that the other two cached files belong in the directory. have been run in the project. It is found in the `runs/` directory inside the cache (or [remote storage](/doc/command-reference/remote)). -Runs are identified as combinations of dependencies, commands, and -outputs that correspond to each other. These combinations are -hashed into special values that make up the file paths inside the run-cache dir. +Runs are identified as combinations of exact dependency contents +(or [parameter](/doc/command-reference/params) values), and the literal +command(s) to execute. These combinations are represented by special hashes that +translate to the file paths inside the run-cache dir: ```dvc $ tree .dvc/cache/runs @@ -151,3 +152,6 @@ run. ๐Ÿ’ก `dvc push` and `dvc pull` (and `dvc fetch`) can download and upload the run-cache to remote storage for sharing and/or as a back up. + +> Note that the run-cache assumes that stage commands are deterministic (see +> **Avoiding unexpected behavior** in `dvc run`). diff --git a/content/docs/user-guide/project-structure/pipelines-files.md b/content/docs/user-guide/project-structure/pipelines-files.md index 8fb5315fc6..c335151e36 100644 --- a/content/docs/user-guide/project-structure/pipelines-files.md +++ b/content/docs/user-guide/project-structure/pipelines-files.md @@ -44,6 +44,8 @@ changed to decide whether the stage requires re-execution (see `dvc status`). If it writes files or dirs, they can be defined as outputs (`outs`). DVC will track them going forward (similar to using `dvc add`). +> See the full stage entry [specification](#stage-entries). + ### Parameter dependencies [Parameters](/doc/command-reference/params) are a special type of stage @@ -337,7 +339,9 @@ stages: > Note that this feature is not compatible with [templating](#templating) at the > moment. -## Specification +## Stage entries + +These are the fields that are accepted in each stage: | Field | Description | | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -369,11 +373,12 @@ validation and auto-completion. > Notice that these are a subset of those in `.dvc` file > [output entries](/doc/user-guide/project-structure/dvc-files#output-entries). -| Field | Description | -| --------- | ------------------------------------------------------------------------------------------------------------------------------------------ | -| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | -| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts | -| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. | +| Field | Description | +| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | +| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts | +| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. | +| `checkpoint` | Set to `true` to let DVC know that this output is associated with [in-code checkpoints](/doc/user-guide/experiment-management#checkpoints-in-source-code) (for `dvc experiments`). | ## dvc.lock file diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md index b845cb7c67..9a913b029e 100644 --- a/content/docs/user-guide/related-technologies.md +++ b/content/docs/user-guide/related-technologies.md @@ -82,6 +82,9 @@ _Luigi_, etc. ## Experiment management software +> See also the [Experiment Management](/doc/user-guide/experiment-management) +> guide. + - DVC uses Git as the underlying version control layer for data, pipelines, and experiments. Data versions exist as metadata in Git, as opposed to using external databases or APIs, so no additional services are required. diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index b89df5ec9a..5b3ef69dfb 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,10 +1,11 @@ # What Is DVC? **Data Version Control** is a new type of data versioning, workflow, and -experiment management software, that builds upon [Git](https://git-scm.com/) -(although it can work stand-alone). DVC reduces the gap between established -engineering tool sets and data science needs, allowing users to take advantage -of new [features](#core-features) while reusing existing skills and intuition. +[experiment management](/doc/user-guide/experiment-management) software, that +builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC +reduces the gap between established engineering tool sets and data science +needs, allowing users to take advantage of new [features](#core-features) while +reusing existing skills and intuition. ![](/img/reproducibility.png) _DVC codifies data and ML experiments_ diff --git a/static/uploads/avatars/jeny_defigueiredo.png b/static/uploads/avatars/jeny_defigueiredo.png new file mode 100644 index 0000000000..4aec186b65 Binary files /dev/null and b/static/uploads/avatars/jeny_defigueiredo.png differ diff --git a/static/uploads/images/2021-02-16/dagshub-logo.png b/static/uploads/images/2021-02-16/dagshub-logo.png new file mode 100644 index 0000000000..1ba97121b0 Binary files /dev/null and b/static/uploads/images/2021-02-16/dagshub-logo.png differ diff --git a/static/uploads/images/2021-02-16/feb21cover.png b/static/uploads/images/2021-02-16/feb21cover.png new file mode 100644 index 0000000000..c335a70257 Binary files /dev/null and b/static/uploads/images/2021-02-16/feb21cover.png differ diff --git a/static/uploads/images/2021-02-16/newstack_image.png b/static/uploads/images/2021-02-16/newstack_image.png new file mode 100644 index 0000000000..5f0221e73d Binary files /dev/null and b/static/uploads/images/2021-02-16/newstack_image.png differ diff --git a/static/uploads/images/2021-02-16/spacy_integration.jpg b/static/uploads/images/2021-02-16/spacy_integration.jpg new file mode 100644 index 0000000000..9811e537b7 Binary files /dev/null and b/static/uploads/images/2021-02-16/spacy_integration.jpg differ diff --git a/static/uploads/images/2021-02-18/dvc-2-0-pre-release.png b/static/uploads/images/2021-02-18/dvc-2-0-pre-release.png new file mode 100644 index 0000000000..e4c0d35624 Binary files /dev/null and b/static/uploads/images/2021-02-18/dvc-2-0-pre-release.png differ diff --git a/static/uploads/images/2021-02-18/dvclive-diff-html.png b/static/uploads/images/2021-02-18/dvclive-diff-html.png new file mode 100644 index 0000000000..be78742c99 Binary files /dev/null and b/static/uploads/images/2021-02-18/dvclive-diff-html.png differ diff --git a/static/uploads/images/2021-02-18/dvclive-html.png b/static/uploads/images/2021-02-18/dvclive-html.png new file mode 100644 index 0000000000..eba95dcd2f Binary files /dev/null and b/static/uploads/images/2021-02-18/dvclive-html.png differ