diff --git a/config/prismjs/dvc-commands.js b/config/prismjs/dvc-commands.js index 41edc54b68..937f9da24e 100644 --- a/config/prismjs/dvc-commands.js +++ b/config/prismjs/dvc-commands.js @@ -25,9 +25,6 @@ module.exports = [ 'plots modify', 'plots diff', 'plots', - 'pipeline show', - 'pipeline list', - 'pipeline', 'move', 'metrics show', 'metrics diff', diff --git a/content/blog/2020-06-22-dvc-1-0-release.md b/content/blog/2020-06-22-dvc-1-0-release.md index 84718de0ae..456df34194 100644 --- a/content/blog/2020-06-22-dvc-1-0-release.md +++ b/content/blog/2020-06-22-dvc-1-0-release.md @@ -2,13 +2,13 @@ title: 'DVC 1.0 release: new features for MLOps' date: 2020-06-22 description: | - Today we're releasing DVC 1.0. New exciting features that users were waiting - for ❤️ . All the details in this blog post. + Today we're releasing DVC 1.0 with new exciting features that users were + waiting for ❤️. Find all the details in this blog post. descriptionLong: | - Today we're releasing DVC 1.0. New exciting features that users were waiting - for ❤️. DVC is a more mature product now with stable release cycles and - benchmarks. All the details in this blog post. + Today we're releasing DVC 1.0. It brings new exciting features that users + were waiting for ❤️. DVC is a more mature product now, with stable release + cycles and benchmarks. Find all the details in this blog post. picture: 2020-06-22/release.png pictureComment: DVC 1.0 release @@ -24,31 +24,31 @@ tags: ## Introduction 3 years ago, I was concerned about good engineering standards in data science: -data versioning, reproducibility, workflow automation - like continuous -integration and continuous delivery (CI/CD) - but for machine learning. I wanted -there to be Git for data to make this possible. So I made DVC (Data Version -Control), which works as version control for data projects. +data versioning, reproducibility, workflow automation — like continuous +integration and continuous delivery (CI/CD), but for machine learning. I wanted +there to be a "Git for data" to make all this possible. So I created DVC (Data +Version Control), which works as version control for data projects. Technically, DVC codifies your data and machine learning pipelines as text metafiles (with pointers to actual data in S3/GCP/Azure/SSH), while you use Git -for the actual versioning. DevOps folks call this approach GitOps or more -specificaly in this case - _DataOps_ or _MLOps_. +for the actual versioning. DevOps folks call this approach GitOps or, more +specifically, in this case _DataOps_ or _MLOps_. The new DVC 1.0. is inspired by discussions and contributions from our community of data scientists, ML engineers, developers and software engineers. ## DVC 1.0 -The new DVC 1.0 is inspired by discussions and contributions from our -community - both fresh ideas and bug reports 😅. All these contributions, big -and small, have a collective impact on DVC's development - I'm confident 1.0 -wouldn't be possible without our community. They tell us what features matter -most, what approaches work (and what don't!), and what they need from DVC to -support their ML projects. +The new DVC 1.0 is inspired by discussions and contributions from our community +— both fresh ideas and bug reports 😅. All these contributions, big and small, +have a collective impact on DVC's development. I'm confident 1.0 wouldn't be +possible without our community. They tell us what features matter most, which +approaches work (and which don't!), and what they need from DVC to support their +ML projects. -A few weeks ago we announced the 1.0 prerelease. After lots of helpful feedback +A few weeks ago we announced the 1.0 pre-release. After lots of helpful feedback from brave users, it's time to go live. Now, DVC 1.0 is available with all the -standard installation methods including pip, conda, brew, choco, and +standard installation methods including `pip`, `conda`, `brew`, `choco`, and system-specific packages: deb, rpm, msi, pkg. See https://dvc.org/doc/install for more details. @@ -62,21 +62,21 @@ learned in 3 years of this journey and how these are reflected in the new DVC. Our users taught us that ML pipelines evolve much faster than data engineering pipelines with data processing steps. People need to change the commands of the -pipeline often and it was not easy to do with the old DVC files. +pipeline often and it was not easy to do this with the old DVC-files. -In DVC 1.0, the DVC file format was changed in three big ways. First, instead of -multiple DVC stage files (`*.dvc`), each project has a single DVC file -`dvc.yaml`. By default, all stages go in this single `.yaml` file. +In DVC 1.0, the DVC metafile format was changed in three big ways. First, +instead of multiple DVC stage files (`*.dvc`), each project has a single +`dvc.yaml` file. By default, all stages go in this single YAML file. -Second, we made clear connections between the `dvc run` command, where pipeline -stages are defined, and how stages appear in `dvc.yaml`. Many of the `dvc run` -options are mirrored in the metafile. We wanted to make it far less complicated -to edit an existing pipeline by making `dvc.yaml` more human readable and -writable. +Second, we made clear connections between the `dvc run` command (a helper to +define pipeline stages), and how stages are defined in `dvc.yaml`. Many of the +options of `dvc run` are mirrored in the metafile. We wanted to make it far less +complicated to edit an existing pipeline by making `dvc.yaml` more human +readable and writable. -Third, data hash values are no longer stored in the pipeline metafile. This -approach aligns better with GitOps paradigms and simplifies the usage of DVC by -tremendously improving metafile human-readability: +Third, file and directory hash values are no longer stored in the pipeline +metafile. This approach aligns better with the GitOps paradigms and simplifies +the usage of DVC by tremendously improving metafile human-readability: ```yaml stages: @@ -99,53 +99,53 @@ stages: - dropout metrics: - logs.csv - - summary.json + - summary.json: cache: false outs: - model.pkl ``` -All the hashes have been moved to a special file, `dvc.lock`, which is a lot -like the old DVC file format. DVC uses the `.lock` file to define what data -files need to be restored to the workspace from data remotes (cloud storage) and -if a particular pipeline stage needs to be rerun. In other words, we're -separating the human-readable parts of the pipeline into `dvc.yaml` and -auto-generated "machine" parts into `dvc.lock`. +All of the hashes have been moved to a special file, `dvc.lock`, which is a lot +like the old DVC-file format. DVC uses this lock file to define which data files +need to be restored to the workspace from data remotes (cloud storage) and if a +particular pipeline stage needs to be rerun. In other words, we're separating +the human-readable parts of the pipeline into `dvc.yaml`, and the auto-generated +"machine" parts into `dvc.lock`. -Another cool change: the auto-generated part doesn't necessarily need to be -stored in your Git repository. The new run-cache feature eliminates the need of -storing the lock file in Git repositories. That brings us to our next big -feature: +Another cool change: the auto-generated part (`dvc.lock`) doesn't necessarily +have to be stored in your Git repository. The new run-cache feature eliminates +the need of storing the lock file in Git repositories. That brings us to our +next big feature: ### [Run cache](https://github.com/iterative/dvc/issues/1234) We built DVC with a workflow in mind: one experiment to one commit. Some users love it, but this approach gets clunky fast for others (like folks who are -grid-searching hyperparameter space). Forcing users to make Git commits for each -ML experiment was a requirement for the old DVC, if you wanted to snapshot your +grid-searching a hyperparameter space). Making Git commits for each ML +experiment was a requirement with the old DVC, if you wanted to snapshot your project or pipelines on each experiment. Moving forward, we want to give users more flexibility to decide how often they want to commit. We had an insight that data remotes (S3, Azure Blob, SSH etc) can be used instead of Git for storing the codified meta information, not only data. In DVC -1.0 a special structure is implemented - run-cache - that preserves the state -including all the hashes. Basically, all the information that is stored in the +1.0, a special structure is implemented, the run-cache, that preserves the state +(including all the hashes). Basically, all the information that is stored in the new `dvc.lock` file is replicated in the run-cache. The advantage of the run-cache is that pipeline runs (and output file versions) -are not directly connected to Git commits anymore. New DVC can store all the -runs in run-cache - even if they were never committed to Git. +are not directly connected to Git commits anymore. The new DVC can store all the +runs in the run-cache, even if they were never committed to Git. -This approach gives DVC a "long memory" of DVC stages runs. If a user runs a -command that was run before (whether Git committed or not), then DVC can return -the result of the command from the cache without rerunning it. It is a useful -feature for a hyperparameter optimization stage - when users return to the +This approach gives DVC a "long memory" of DVC stages runs. If a user tries to +run a stage that was previously run (whether committed to Git or not), then DVC +can return the result from the run-cache without rerunning it. It is a useful +feature for a hyperparameter optimization stage — when users return to the previous sets of the parameters and don't want to wait for ML retraining. Another benefit of the run-cache is related to CI/CD systems for ML, which is a holy grail of MLOps. The long memory means users don't have to make auto-commits in their CI/CD system side - see -[this stackowerflow question](https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments). +[this Stackowerflow question](https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments). ### [Plots](https://github.com/iterative/dvc/issues/3409) diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index 82bd56d501..baafff3655 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -65,8 +65,8 @@ progress made by the checkout. There are two methods to restore a file missing from the cache, depending on the situation. In some cases a pipeline must be reproduced (using `dvc repro`) to -regenerate its outputs (see also `dvc pipeline`). In other cases the cache can -be pulled from remote storage using `dvc pull`. +regenerate its outputs (see also `dvc dag`). In other cases the cache can be +pulled from remote storage using `dvc pull`. ## Options diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md new file mode 100644 index 0000000000..ff01c3f9f4 --- /dev/null +++ b/content/docs/command-reference/dag.md @@ -0,0 +1,108 @@ +# dag + +Show [stages](/doc/command-reference/run) in a pipeline that lead to the +specified stage. By default it lists +[DVC-files](/doc/user-guide/dvc-files-and-directories). + +## Synopsis + +```usage +usage: dvc dag [-h] [-q | -v] [--dot] [--full] [target] + +positional arguments: + targets Stage or output to show pipeline for (optional) + Finds all stages in the workspace by default. +``` + +## Description + +A data pipeline, in general, is a series of data processing +[stages](/doc/command-reference/run) (for example console commands that take an +input and produce an output). A pipeline may produce intermediate +data, and has a final result. Machine learning (ML) pipelines typically start a +with large raw datasets, include intermediate featurization and training stages, +and produce a final model, as well as accuracy +[metrics](/doc/command-reference/metrics). + +In DVC, pipeline stages and commands, their data I/O, interdependencies, and +results (intermediate or final) are specified with `dvc add` and `dvc run`, +among other commands. This allows DVC to restore one or more pipelines of stages +interconnected by their dependencies and outputs later. (See `dvc repro`.) + +> DVC builds a dependency graph +> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this. + +`dvc dag` displays the stages of a pipeline up to the target stage. If `target` +is omitted, it will show the full project DAG. + +## Options + +- `--dot` - show DAG in + [DOT]() + format. It can be passed to third party visualization utilities. + +- `--full` - show full DAG that the `target` belongs too, instead of showing the + part that consists only of the target ancestors. + +- `-h`, `--help` - prints the usage/help message, and exit. + +- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no + problems arise, otherwise 1. + +- `-v`, `--verbose` - displays detailed tracing information. + +## Paging the output + +This command's output is automatically piped to +[Less](), if available in the +terminal. (The exact command used is `less --chop-long-lines --clear-screen`.) +If `less` is not available (e.g. on Windows), the output is simply printed out. + +> It's also possible to +> [enable Less paging on Windows](/doc/user-guide/running-dvc-on-windows#enabling-paging-with-less). + +### Providing a custom pager + +It's possible to override the default pager via the `DVC_PAGER` environment +variable. For example, the following command will replace the default pager with +[`more`](), for a single run: + +```dvc +$ DVC_PAGER=more dvc dag +``` + +For a persistent change, define `DVC_PAGER` in the shell configuration. For +example in Bash, we could add the following line to `~/.bashrc`: + +```bash +export DVC_PAGER=more +``` + +## Examples + +Visualize DVC pipeline: + +```dvc +$ dvc dag + +---------+ + | prepare | + +---------+ + * + * + * + +-----------+ + | featurize | + +-----------+ + ** ** + ** * + * ** ++-------+ * +| train | ** ++-------+ * + ** ** + ** ** + * * + +----------+ + | evaluate | + +----------+ +``` diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index ff51fdc807..ac5ea9b2b5 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -6,10 +6,8 @@ Get tracked files or directories from ## Synopsis ```usage -usage: dvc fetch [-h] [-q | -v] [-j ] - [-r ] [-a] [-T] - [--all-commits] [-d] [-R] - [--run-cache] +usage: dvc fetch [-h] [-q | -v] [-j ] [-r ] [-a] [-T] + [--all-commits] [-d] [-R] [--run-cache] [targets [targets ...]] positional arguments: diff --git a/content/docs/command-reference/freeze.md b/content/docs/command-reference/freeze.md index 006a8686ff..0ab6b75fc5 100644 --- a/content/docs/command-reference/freeze.md +++ b/content/docs/command-reference/freeze.md @@ -1,7 +1,7 @@ # freeze -Freeze a [stage](/doc/command-reference/run). Use `dvc unfreeze` to unfreeze the -stage. +Freeze [stages](/doc/command-reference/run) until `dvc unfreeze` is used on +them. Frozen stages are never executed by `dvc repro`. ## Synopsis @@ -9,24 +9,24 @@ stage. usage: dvc freeze [-h] [-q | -v] targets [targets ...] positional arguments: - targets stages to freeze. + targets Stages or .dvc files to freeze ``` ## Description -`dvc freeze` causes any stage to be considered _not changed_ by `dvc status` and -`dvc repro`. Stage reproduction will not regenerate outputs of -frozen stages, even if some dependencies have changed, and even if `--force` is -provided. +`dvc freeze` causes the [stages](/doc/command-reference/run) indicated as +`targets` to be considered _not changed_ by `dvc status` and `dvc repro`. Stage +reproduction will not regenerate outputs of frozen stages, even if +their dependencies have changed, and even if `--force` is used. Freezing a stage is useful to avoid syncing data from the top of its [pipeline](/doc/command-reference/pipeline), and keep iterating on the last -(unfrozen) stages only. +(non-frozen) stages only. -Note that import stages are considered always frozen. Use -`dvc update` to update the corresponding data artifacts from the -external data source. [Unfreeze](/doc/command-reference/unfreeze) them before -using `dvc repro` on a pipeline that needs their outputs. +Note that import stages are frozen by default. Use `dvc update` to +update the corresponding data artifacts from the external data +source. [Unfreeze](/doc/command-reference/unfreeze) them before using +`dvc repro` on a pipeline that needs their outputs. ## Options @@ -39,56 +39,42 @@ using `dvc repro` on a pipeline that needs their outputs. ## Examples -First, let's create a simple DVC-file: +First, let's create a dummy stage that copies `foo` to `bar`: ```dvc $ echo foo > foo $ dvc add foo -Saving information ... - -$ dvc run -d foo -o bar cp foo bar -Running command: - cp foo bar -... +$ dvc run -n make_copy -d foo -o bar cp foo bar ``` -Then, let's change the file `foo` that the stage described in `bar.dvc` depends -on: +> See `dvc run` for more details. + +Then, let's change the file `foo` that the stage `make_copy` depends on: ```dvc -$ rm foo -$ echo foo1 > foo +$ echo zoo > foo $ dvc status - -bar.dvc - deps - changed: foo -foo.dvc - outs - changed: foo +make_copy: + changed deps: + modified: foo +foo.dvc: + changed outs: + modified: foo ``` -Now, let's freeze the `bar` stage: +`dvc status` notices that `foo` has changed. Let's now freeze the `make_copy` +stage and see what's the project status after that: ```dvc -$ dvc freeze bar.dvc +$ dvc freeze make_copy $ dvc status - - foo.dvc - outs - changed: foo +foo.dvc: + changed outs: + modified: foo ``` -Run `dvc unfreeze` to unfreeze it back: +DVC notices that `foo` changed due to the `foo.dvc` file that tracks this file +(as `outs`), but the `make_copy` stage no longer records the change among it's +`deps`. -```dvc -$ dvc unfreeze bar.dvc -$ dvc status - - bar.dvc - deps - changed: foo - foo.dvc - outs - changed: foo -``` +> You can use `dvc unfreeze` to go back to the regular project status. diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index f60584954b..130b20b446 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -20,7 +20,7 @@ DVC cache but are no longer needed. With `--cloud` it also removes data in To avoid accidentally deleting data, it raises an error and doesn't touch any files if no scope options are provided. It means it's user's responsibility to explicitly provide the right set of options to specify what data is still needed -(so that DVC can figure out what fils can be safely deleted). +(so that DVC can figure out what files can be safely deleted). One of the scope options (`--workspace`, `--all-branches`, `--all-tags`, `--all-commits`) or a combination of them must be provided. Each of them @@ -28,6 +28,13 @@ corresponds to keeping the data for the current workspace, and for a certain set of commits (determined by reading the DVC-files in them). See the [Options](#options) section for more details. +> Note that `dvc gc` tries to fetch any missing +> [`.dir` files](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> from [remote storage](/doc/command-reference/remote) to the local +> cache, in order to know which files should exist inside cached +> directories. These files may be missing if the cache directory was previously +> garbage collected, in a newly cloned copy of the repo, etc. + Unless the `--cloud` option is used, `dvc gc` does not remove data files from any remote. This means that any files collected from the local cache can be restored using `dvc fetch`, as long as they have previously been uploaded with diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index b344fa8d59..8380f9508c 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -121,9 +121,9 @@ a regular stage (in the imported data (`out`). - `--no-exec` - create `.dvc` file without actually downloading `url`. E.g. if - file or directory already exists `--no-exec` can be used to skip download. In - this case, `dvc commit .dvc` should be used to calculate URL and data - hash, update generated .`dvc` file and save existing data into the DVC cache. + the file or directory already exist it can be used to skip download. + `dvc commit .dvc` should be used to calculate the URL and data hash, + update the .`dvc` files, and save existing data to the cache. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 4965558e99..54cbf603b4 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -61,9 +61,8 @@ sub-projects to mitigate the issues of initializing in the Git repository root: download files and directories, to reproduce pipelines, etc. It can be expensive in the large repositories with a lot of projects. -- Not enough isolation/granularity - commands like `dvc metrics diff`, - `dvc pipeline show` and others by default dump all the metrics, all the - pipelines, etc. +- Not enough isolation/granularity - commands like `dvc metrics diff`, `dvc dag` + and others by default dump all the metrics, all the pipelines, etc. #### How does it affect DVC commands? diff --git a/content/docs/command-reference/pipeline/index.md b/content/docs/command-reference/pipeline/index.md deleted file mode 100644 index b9cf3eb4ae..0000000000 --- a/content/docs/command-reference/pipeline/index.md +++ /dev/null @@ -1,47 +0,0 @@ -# pipeline - -A set of commands to manage -[pipelines](/doc/tutorials/get-started/data-pipelines): -[show](/doc/command-reference/pipeline/show) and -[list](/doc/command-reference/pipeline/list). - -## Synopsis - -```usage -usage: dvc pipeline [-h] [-q | -v] {show,list} ... - -positional arguments: - COMMAND - show Show pipeline. - list List pipelines. -``` - -## Description - -A data pipeline, in general, is a series of data processing -[stages](/doc/command-reference/run) (for example console commands that take an -input and produce an output). A pipeline may produce intermediate -data, and has a final result. Machine learning (ML) pipelines typically start a -with large raw datasets, include intermediate featurization and training stages, -and produce a final model, as well as accuracy -[metrics](/doc/command-reference/metrics). - -In DVC, pipeline stages and commands, their data I/O, interdependencies, and -results (intermediate or final) are specified with `dvc add` and `dvc run`, -among other commands. This allows DVC to restore one or more pipelines of stages -interconnected by their dependencies and outputs later. (See `dvc repro`.) - -> DVC builds a dependency graph -> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this. - -`dvc pipeline` commands help users display the existing project pipelines in -different ways. - -## Options - -- `-h`, `--help` - prints the usage/help message, and exit. - -- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no - problems arise, otherwise 1. - -- `-v`, `--verbose` - displays detailed tracing information. diff --git a/content/docs/command-reference/pipeline/list.md b/content/docs/command-reference/pipeline/list.md deleted file mode 100644 index 3ee8cfdb2b..0000000000 --- a/content/docs/command-reference/pipeline/list.md +++ /dev/null @@ -1,41 +0,0 @@ -# pipeline list - -List connected groups of [stages](/doc/command-reference/run) (pipelines). - -## Synopsis - -```usage -usage: dvc pipeline list [-h] [-q | -v] -``` - -## Description - -Displays a list of all existing stages in the project, grouped in -their corresponding [pipeline](/doc/command-reference/pipeline), when connected. - -> Note that the stages in these lists are in ascending order, that is, from last -> to first. - -## Options - -- `-h`, `--help` - prints the usage/help message, and exit. - -- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no - problems arise, otherwise 1. - -- `-v`, `--verbose` - displays detailed tracing information. - -## Examples - -List available pipelines: - -```dvc -$ dvc pipeline list -Dvcfile -====================================================================== -raw.dvc -data.dvc -output.dvc -====================================================================== -2 pipelines total -``` diff --git a/content/docs/command-reference/pipeline/show.md b/content/docs/command-reference/pipeline/show.md deleted file mode 100644 index 57848ead1e..0000000000 --- a/content/docs/command-reference/pipeline/show.md +++ /dev/null @@ -1,156 +0,0 @@ -# pipeline show - -Show [stages](/doc/command-reference/run) in a pipeline that lead to the -specified stage. By default it lists -[DVC-files](/doc/user-guide/dvc-files-and-directories). - -## Synopsis - -```usage -usage: dvc pipeline show [-h] [-q | -v] [-c | -o] [-l] [--ascii] - [--dot] [--tree] - [targets [targets ...]] - -positional arguments: - targets DVC-files to show pipeline for. Optional. - (Finds all DVC-files in the workspace by default.) -``` - -## Description - -`dvc show` displays the stages of a pipeline up to one or more target DVC-files -(stage files). All stages are shown unless specific `targets` are specified. The -`-c` and `-o` options allow to list the corresponding commands or data file flow -instead of stages. - -> Note that the stages in these lists are in descending order, that is, from -> first to last. - -## Options - -- `-c`, `--commands` - show pipeline as a list (diagram if `--ascii` or `--dot` - is used) of commands instead of paths to DVC-files. - -- `-o`, `--outs` - show pipeline as a list (diagram if `--ascii` or `--dot` is - used) of stage outputs instead of paths to DVC-files. - -- `--ascii` - visualize pipeline. It will print a graph (ASCII) instead of a - list of path to DVC-files. (`less` pager may be used, see - [Paging the output](#paging-the-output) below for details). - -- `--dot` - show contents of `.dot` files with a DVC pipeline graph. It can be - passed to third party visualization utilities. - -- `--tree` - list dependencies tree like recursive directory listing. - -- `-l`, `--locked` - print frozen stages only. See `dvc freeze`. - -- `-h`, `--help` - prints the usage/help message, and exit. - -- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no - problems arise, otherwise 1. - -- `-v`, `--verbose` - displays detailed tracing information. - -## Paging the output - -This command's output is automatically piped to -[Less](), if available in the -terminal. (The exact command used is `less --chop-long-lines --clear-screen`.) -If `less` is not available (e.g. on Windows), the output is simply printed out. - -> It's also possible to -> [enable Less paging on Windows](/doc/user-guide/running-dvc-on-windows#enabling-paging-with-less). - -### Providing a custom pager - -It's possible to override the default pager via the `DVC_PAGER` environment -variable. For example, the following command will replace the default pager with -[`more`](), for a single run: - -```bash -$ DVC_PAGER=more dvc pipeline show --ascii my-pipeline.dvc -``` - -For a persistent change, define `DVC_PAGER` in the shell configuration. For -example in Bash, we could add the following line to `~/.bashrc`: - -```bash -export DVC_PAGER=more -``` - -## Examples - -Default mode: show stage files that `output.dvc` recursively depends on: - -```dvc -$ dvc pipeline show output.dvc -raw.dvc -data.dvc -output.dvc -``` - -The same as previous, but show commands instead of DVC-files: - -```dvc -$ dvc pipeline show output.dvc --commands -download.py s3://mybucket/myrawdata raw -cleanup.py raw data -process.py data output -``` - -Visualize DVC pipeline: - -```dvc -$ dvc pipeline show eval.txt.dvc --ascii - .------------------------. - | data/Posts.xml.zip.dvc | - `------------------------' - * - * - * - .---------------. - | Posts.xml.dvc | - `---------------' - * - * - * - .---------------. - | Posts.tsv.dvc | - `---------------' - * - * - * - .---------------------. - | Posts-train.tsv.dvc | - `---------------------' - * - * - * - .--------------------. - | matrix-train.p.dvc | - `--------------------' - *** *** - ** *** - ** ** -.-------------. ** -| model.p.dvc | ** -`-------------' *** - *** *** - ** ** - ** ** - .--------------. - | eval.txt.dvc | - `--------------' -``` - -List dependencies recursively if the graph has a tree structure: - -```dvc -$ dvc pipeline show e.file.dvc --tree -e.file.dvc -├── c.file.dvc -│ └── b.file.dvc -│ └── a.file.dvc -└── d.file.dvc -``` diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index f43a24d976..17a6d66f20 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -9,8 +9,8 @@ Download tracked files or directories from ## Synopsis ```usage -usage: dvc pull [-h] [-q | -v] [-j ] - [-r ] [-a] [-T] [-d] [-f] [-R] [--all-commits] +usage: dvc pull [-h] [-q | -v] [-j ] [-r ] [-a] [-T] + [-d] [-f] [-R] [--all-commits] [--run-cache] [targets [targets ...]] positional arguments: diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 5cf75dbde2..ea924c7a82 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -129,8 +129,8 @@ only execute the final stage. The stage is only executed if the user types "y". - `-p`, `--pipeline` - reproduce the entire pipelines that the stage file - `targets` belong to. Use `dvc pipeline show .dvc` to show the parent - pipeline of a target stage. + `targets` belong to. Use `dvc dag ` to show the parent pipeline of a + target stage. - `-P`, `--all-pipelines` - reproduce all pipelines, for all the stage files present in `DVC` repository. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 41a25f29bc..f81780bd81 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -197,9 +197,6 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' up to the user to save and version control it. See also the difference between `-o` and `-O`. -- `--external` - allow outputs that are outside of the DVC repository. See - [Managing External Data](/doc/user-guide/managing-external-data). - - `-w `, `--wdir ` - specifies a working directory for the `command` to run in (uses the `wdir` field in `dvc.yaml`). Dependency and output files (including metrics and plots) should be specified relative to this directory. @@ -251,37 +248,46 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' ## Examples -Let's see a few introductory examples to try the different command options and -see what they do. - -Create a DVC project and a stage that writes a JSON file as output: +Let's create a DVC project and a stage (that counts the number of +lines in a `test.txt` file): ```dvc $ mkdir example && cd example $ git init $ dvc init $ mkdir data -$ dvc run --name json_struct -d data -o struct.json \ - "echo '{\"a_number\": 0.75}' > struct.json" -Running stage 'json_struct' with command: - echo '{"a_number": 0.75}' > struct.json +$ dvc run -n count \ + -d test.txt \ + -o lines \ + "cat test.txt | wc -l > lines" +Running stage 'test' with command: + cat test.txt | wc -l > lines Creating 'dvc.yaml' -Adding stage 'json_struct' in 'dvc.yaml' +Adding stage 'count' in 'dvc.yaml' Generating lock file 'dvc.lock' + +$ tree +. +├── dvc.lock +├── dvc.yaml +├── lines +└── test.txt ``` This results in the following stage entry in `dvc.yaml`: ```yaml stages: - json_struct: - cmd: 'echo ''{"a_number": 0.75}'' > struct.json' + count: + cmd: 'cat test.txt | wc -l > lines' deps: - - data + - test.txt outs: - - struct.json + - lines ``` +## Example: Overwrite an existing stage + The following stage runs a Python script that trains an ML model on the training dataset (`20180226` is a seed value): @@ -300,22 +306,29 @@ $ dvc run -n train -f -d matrix-train.p -d train_model.py -o model.p \ python train_model.py matrix-train.p 18494003 model.p ``` -Move to a subdirectory and create a stage there. This generates a separate +## Example: Separate stages in a subdirectory + +Let's move to a subdirectory and create a stage there. This generates a separate `dvc.yaml` file in that location. The stage command itself counts the lines in -`test.txt` and writes the number to `result.out`. +`test.txt` and writes the number to `lines`. ```dvc -$ cd stages/ -$ dvc run -n test \ - -d test.txt \ +$ cd more_stages/ +$ dvc run -n process_data \ + -d data.in \ -o result.out \ - "cat test.txt | wc -l > result.out" -$ tree + ./my_script.sh data.in result.out +$ tree .. . -├── dvc.lock ├── dvc.yaml -├── result.out -└── test.txt +├── dvc.lock +├── file1 +├── ... +└── more_stages/ + ├── data.in + ├── dvc.lock + ├── dvc.yaml + └── result.out ``` ## Example: Chaining stages diff --git a/content/docs/command-reference/unfreeze.md b/content/docs/command-reference/unfreeze.md index f169973887..d413a231c4 100644 --- a/content/docs/command-reference/unfreeze.md +++ b/content/docs/command-reference/unfreeze.md @@ -1,7 +1,7 @@ # unfreeze -Unfreeze [stage](/doc/command-reference/run). See `dvc freeze` for more -information. +Unfreeze [stages](/doc/command-reference/run) so that `dvc repro` can execute +them. See `dvc freeze` for more information. ## Synopsis @@ -9,20 +9,23 @@ information. usage: dvc unfreeze [-h] [-q | -v] targets [targets ...] positional arguments: - targets stages to unfreeze. + targets Stages or .dvc files to unfreeze + (see also `dvc freeze`). ``` ## Description -There are several reasons that can produce data files to be frozen in a DVC -project, `dvc freeze` being the most obvious one. +There are several ways that tracked data files can be frozen, `dvc freeze` being +one of them. Frozen stages are considered _not changed_ by `dvc status` and +`dvc repro`. -If `dvc unfreeze` is used on frozen stages, they will start to be checked by -`dvc status`, and updated by `dvc repro`. +If `dvc unfreeze` is used on frozen stages, they will start being checked again +by `dvc status`, and regenerated by `dvc repro`. -Note that import stages are considered always frozen. They can not -be unfrozen. Use `dvc update` on them to update the file, directory, or -data artifact from its external data source. +Note that import stages are frozen by default. Use `dvc update` to +update the corresponding data artifacts from the external data +source. [Unfreeze](/doc/command-reference/unfreeze) them before using +`dvc repro` on a pipeline that needs their outputs. ## Options @@ -35,56 +38,39 @@ be unfrozen. Use `dvc update` on them to update the file, directory, or ## Examples -First, let's create a simple DVC-file: +First, let's create a dummy stage that copies `foo` to `bar`: ```dvc $ echo foo > foo $ dvc add foo -Saving information ... - -$ dvc run -d foo -o bar cp foo bar -Running command: - cp foo bar -... +$ dvc run -n make_copy -d foo -o bar cp foo bar ``` -Then, let's change the file `foo` that the stage described in `bar.dvc` depends -on: - -```dvc -$ rm foo -$ echo foo1 > foo -$ dvc status - -bar.dvc - deps - changed: foo -foo.dvc - outs - changed: foo -``` +> See `dvc run` for more details. -Now, let's freeze the `bar` stage: +Then, let's change the file `foo` that the stage `make_copy` depends on, and +freeze stage as well, to see what's the project status after that: ```dvc -$ dvc freeze bar.dvc +$ echo zoo > foo +$ dvc freeze make_copy $ dvc status - - foo.dvc - outs - changed: foo +foo.dvc: + changed outs: + modified: foo ``` -Run `dvc unfreeze` to unfreeze it back: +DVC notices that `foo` changed due to the `foo.dvc` file that tracks this file +(as `outs`), but the `make_copy` stage doesn't records the change among it's +dependencies. Run `dvc unfreeze` to get the regular/full project status: ```dvc -$ dvc unfreeze bar.dvc +$ dvc unfreeze make_copy $ dvc status - - bar.dvc - deps - changed: foo - foo.dvc - outs - changed: foo +make_copy: + changed deps: + modified: foo +foo.dvc: + changed outs: + modified: foo ``` diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index ff700adeb2..8ed98b7b81 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -34,10 +34,30 @@ "katacoda": "https://katacoda.com/dvc/courses/get-started" }, "children": [ - "data-versioning", - "data-access", - "data-pipelines", - "experiments" + { + "slug": "data-versioning", + "tutorials": { + "katacoda": "https://katacoda.com/dvc/courses/get-started" + } + }, + { + "slug": "data-access", + "tutorials": { + "katacoda": "https://katacoda.com/dvc/courses/get-started" + } + }, + { + "slug": "data-pipelines", + "tutorials": { + "katacoda": "https://katacoda.com/dvc/courses/get-started" + } + }, + { + "slug": "experiments", + "tutorials": { + "katacoda": "https://katacoda.com/dvc/courses/get-started" + } + } ] }, { @@ -155,6 +175,10 @@ "label": "config", "slug": "config" }, + { + "label": "dag", + "slug": "dag" + }, { "label": "destroy", "slug": "destroy" @@ -236,21 +260,6 @@ } ] }, - { - "label": "pipeline", - "slug": "pipeline", - "source": "pipeline/index.md", - "children": [ - { - "label": "pipeline list", - "slug": "list" - }, - { - "label": "pipeline show", - "slug": "show" - } - ] - }, { "label": "plots", "slug": "plots", diff --git a/content/docs/start/data-access.md b/content/docs/start/data-access.md index 1b240ff3a6..9376c95b25 100644 --- a/content/docs/start/data-access.md +++ b/content/docs/start/data-access.md @@ -10,12 +10,11 @@ specific version of a model? How do I reuse datasets across different projects? > `s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673` 😱 > instead of the original files, name such as `model.pkl` or `data.xml`. -Remember those `.dvc` files `dvc add` generates? Or the `dvc.yaml` and -`dvc.lock` pipeline files produced by `dvc run`? Those files, their history in -Git, DVC remote storage config saved in Git contain all the information needed -to access and download any version of datasets, files, and models. It means that -Git repository with DVC files becomes an entry point and can be used instead of -accessing files directly. +Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock` +that we'll cover later), their history in Git, DVC remote storage config saved +in Git contain all the information needed to access and download any version of +datasets, files, and models. It means that Git repository with DVC files becomes +an entry point and can be used instead of accessing files directly. ## Find a file or directory @@ -57,7 +56,7 @@ file that can be saved in the project: ```dvc $ dvc import https://github.com/iterative/dataset-registry \ - get-started/data/data.xml -o data/data.xml + get-started/data.xml -o data/data.xml ``` This is similar to `dvc get` + `dvc add`, but the resulting @@ -103,7 +102,7 @@ directly from within an application at runtime. For example: import dvc.api with dvc.api.open( - 'get-started/data/data.xml', + 'get-started/data.xml', repo='https://github.com/iterative/dataset-registry' ) as fd: # ... fd is a file descriptor that can be processed normally. diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 49fe7c4dd1..9a530a97de 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -215,7 +215,7 @@ $ dvc repro
-### ⚙️ Expand to make some fun with it. +### ⚙️ Expand to have some fun with it Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 0ce96fca48..931aa9a254 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -5,10 +5,10 @@ directory (`.dvc/`) with the [internal directories and files](#internal-directories-and-files) needed for DVC operation. -Additionally, there are two special kind of files created by certain +Additionally, there are a few special kind of files created by certain [DVC commands](/doc/command-reference): -- Files ending with the `.dvc` extension are placeholders to version data files +- Files ending with the `.dvc` extension are placeholders to track data files and directories. A DVC project usually has one [`.dvc` file](#dvc-files) per large data file or dataset directory being tracked. @@ -16,9 +16,15 @@ Additionally, there are two special kind of files created by certain that form the pipeline(s) of a project, and their connections (_dependency graph_ or DAG). -Both use human-friendly YAML schemas, described below. We encourage you to get -familiar with them so you may edit them freely, as needed. Both type of files -should be versioned with Git (for Git-enabled repositories). + These typically come with a matching [`dvc.lock` file](#dvclock-file) to + record the pipeline state and track its data artifacts. + +Both `.dvc` files and `dvc.yaml` use human-friendly YAML schemas, described +below. We encourage you to get familiar with them so you may create, generate, +and edit them on your own. + +All these should be versioned with Git (in Git-enabled +repositories). ## .dvc files @@ -49,8 +55,9 @@ meta: - `deps`: List of dependency entries for this stage, only present when `dvc import` and `dvc import-url` are used. Typically there is only one (but several can be added manually). -- `wdir` (optional): Working directory for the stage command to run in. If this - field is not present explicitly, its value defaults to the file's location. +- `wdir` (optional): Working directory for the stage command to run in (relative + to the file's location). If this field is not present explicitly, it defaults + to `.` (the file's location). - `meta` (optional): Arbitrary metadata can be added manually with this field. Any YAML contents is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read `.dvc` files. @@ -58,13 +65,15 @@ meta: An _output entry_ can consist of these fields: - `md5`: Hash value for the file or directory being tracked with DVC -- `path`: Path to the file or directory (relative to `wdir`) -- `cache`: Whether or not DVC should cache the file or directory. `true` by - default +- `path`: Path to the file or directory (relative to `wdir` which defaults to + the file's location) +- `cache`: Whether or not this file or directory is cached (`true` + by default, if not present). See the `--no-commit` option of `dvc add`. A _dependency entry_ consists of a these possible fields: -- `path`: Path to the dependency (relative to `wdir`) +- `path`: Path to the dependency (relative to `wdir` which defaults to the + file's location) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) @@ -86,8 +95,16 @@ and `dvc commit` commands, but not when a `.dvc` file is overwritten by ## dvc.yaml file -When you add commands to a pipeline with `dvc run`, the `dvc.yaml` file is -created or updated. Here's a simple example: +`dvc.yaml` files describe data pipelines, similar to how +[Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction) +work for building software. Its YAML structure contains a list of stages which +can be written manually or generated by user code. + +> A helper command, `dvc run`, is also available to add or update stages in +> `dvc.yaml`. Additionally, a [`dvc.lock`](#dvclock-file) file is also created +> or updated by `dvc run` and `dvc repro`, to record the pipeline state. + +Here's a comprehensive `dvc.yaml` example: ```yaml stages: @@ -121,19 +138,27 @@ by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain the possible following fields: - `cmd` (always present): Executable command defined in this stage +- `wdir`: Working directory for the stage command to run in (relative to the + file's location). If this field is not present explicitly, it defaults to `.` + (the file's location). - `deps`: List of dependency file or directory paths of this stage - (relative to `wdir`) + (relative to `wdir` which defaults to the file's location) - `params`: List of [parameter dependencies](/doc/command-reference/params). These are key paths referring to a YAML or JSON file (`params.yaml` by default). - `outs`: List of output file or directory paths of this stage - (relative to `wdir`) -- `metrics`: List of [metric files](/doc/command-reference/metrics) -- `plots`: List of [plot metrics](/doc/command-reference/plots) and optionally, + (relative to `wdir` which defaults to the file's location), and optionally, + whether or not this file or directory is cached (`true` by + default, if not present). See the `--no-commit` option of `dvc run`. +- `metrics`: List of [metrics files](/doc/command-reference/metrics), and + optionally, whether or not this metrics file is cached (`true` by + default, if not present). See the `--metrics-no-cache` (`-M`) option of + `dvc run`. +- `plots`: List of [plot metrics](/doc/command-reference/plots), and optionally, their default configuration (subfields matching the options of - `dvc plots modify`) -- `wdir`: Working directory for the stage command to run in. If this field is - not present explicitly, its value defaults to the file's location. + `dvc plots modify`), and whether or not this plots file is cached + ( `true` by default, if not present). See the `--plots-no-cache` option of + `dvc run`. - `frozen`: Whether or not this stage is frozen from reproduction - `always_changed`: Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default @@ -143,6 +168,56 @@ the possible following fields: `dvc.yaml` files also support `# comments`. +### dvc.lock file + +For every `dvc.yaml` file, a matching `dvc.lock` (YAML) file usually exists. +It's created or updated by DVC commands such as `dvc run` and `dvc repro`. +`dvc.lock` describes the latest pipeline state. It has several purposes: + +- Tracking of intermediate and final results of a pipeline — similar to + [`.dvc` files](#dvc-files). +- Allow DVC to detect when stage definitions, or their dependencies have + changed. Such conditions invalidate states, requiring their reproduction (see + `dvc status`, `dvc repro`). +- `dvc.lock` is needed internally for several DVC commands to operate, such as + `dvc checkout`, `dvc get`, and `dvc import`. + +Here's an example `dvc.lock` based on the one in [`dvc.yaml`](#dvcyaml-file) +above: + +```yaml +stages: + features: + cmd: jupyter nbconvert --execute featurize.ipynb + deps: + - path: data/clean + md5: d8b874c5fa18c32b2d67f73606a1be60 + params: + params.yaml: + levels.no: 5 + outs: + - path: features + md5: 2119f7661d49546288b73b5730d76485 + - path: performance.json + md5: ea46c1139d771bfeba7942d1fbb5981e + - path: logs.csv + md5: f99aac37e383b422adc76f5f1fb45004 +``` + +Stage commands are listed again in `dvc.lock`, in order to know when their +definitions change in the `dvc.yaml` file. + +Regular dependencies and all kinds of outputs +(including [metrics](/doc/command-reference/metrics) and +[plots](/doc/command-reference/plots) files) are also listed (per stage) in +`dvc.lock`, but with an additional field to store the hash value of each file or +directory tracked by DVC: `md5` for local file system dependencies and SSH +external dependencies, `etag` for HTTP, S3, Azure, and Google +Cloud, and `checksum` for HDFS. + +[Parameter](/doc/command-reference/params#examples) key/value pairs are listed +separately under `params`, grouped by parameters file. + ## Internal directories and files - `.dvc/config`: This is a configuration file. The config file can be edited by diff --git a/content/docs/user-guide/running-dvc-on-windows.md b/content/docs/user-guide/running-dvc-on-windows.md index e658eee649..1d4e216a08 100644 --- a/content/docs/user-guide/running-dvc-on-windows.md +++ b/content/docs/user-guide/running-dvc-on-windows.md @@ -70,8 +70,8 @@ directory, as explained in ## Enabling paging with `less` By default, DVC tries to use [Less]() -as pager for the output of `dvc pipeline show`. Windows doesn't have the `less` -command available however. Fortunately, there is a easy way of installing it via +as pager for the output of `dvc dag`. Windows doesn't have the less command +available however. Fortunately, there is a easy way of installing `less` via [Chocolatey](https://chocolatey.org/) (please install the tool first): ```dvc diff --git a/redirects-list.json b/redirects-list.json index 391cbf27e5..9285fed991 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -33,6 +33,9 @@ "^/doc/command-reference/plot$ /doc/command-reference/plots", "^/doc/command-reference/lock$ /doc/command-reference/freeze", "^/doc/command-reference/unlock$ /doc/command-reference/unfreeze", + "^/doc/command-reference/pipeline$ /doc/command-reference/dag", + "^/doc/command-reference/pipeline/show$ /doc/command-reference/dag", + "^/doc/command-reference/pipeline/list$ /doc/command-reference/dag", "^/(.+)/$ /$1" ]