diff --git a/content/docs/api-reference/open.md b/content/docs/api-reference/open.md index 772e3ae0cf..cddae26836 100644 --- a/content/docs/api-reference/open.md +++ b/content/docs/api-reference/open.md @@ -108,10 +108,10 @@ with dvc.api.open( Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because `dvc.api.open()` is able to stream the data from -[remote storage](/doc/command-reference/remote/add#supported-storage-types). -(The `mySAXHandler` object should handle the event-driven parsing of the -document in this case.) This increases the performance of the code (minimizing -memory usage), and is typically faster than loading the whole data into memory. +[remote storage](/doc/command-reference/remote/add). (The `mySAXHandler` object +should handle the event-driven parsing of the document in this case.) This +increases the performance of the code (minimizing memory usage), and is +typically faster than loading the whole data into memory. > If you just needed to load the complete file contents into memory, you can use > `dvc.api.read()` instead: diff --git a/content/docs/command-reference/cache/index.md b/content/docs/command-reference/cache/index.md index 04999a469a..d8e565a806 100644 --- a/content/docs/command-reference/cache/index.md +++ b/content/docs/command-reference/cache/index.md @@ -15,9 +15,9 @@ positional arguments: ## Description -At DVC initialization, a new `.dvc/` directory will be created for internal -configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories) that are +At DVC initialization, a new `.dvc/` directory is created for internal +configuration and cache +[files and directories](/doc/user-guide/dvc-files-and-directories), that are hidden from the user. The cache is where your data files, models, etc. (anything you want to version diff --git a/content/docs/command-reference/diff.md b/content/docs/command-reference/diff.md index e1515c56bd..155c6a7ae4 100644 --- a/content/docs/command-reference/diff.md +++ b/content/docs/command-reference/diff.md @@ -107,7 +107,8 @@ $ dvc diff Let's checkout the [3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file) -tag, corresponding to the [Add Files](/doc/tutorials/get-started/add-files) _Get +tag, corresponding to the +[tracking data](/doc/tutorials/get-started/data-versioning#tracking-data) _Get Started_ chapter, right after we added `data.xml` file with DVC: ```dvc diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index a4d257551e..7e63436ad3 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -47,16 +47,16 @@ in the project's cache. (Refer to `dvc remote` for more information on DVC remotes.) These necessary data or model files are listed as dependencies or outputs in a DVC-file (target [stage](/doc/command-reference/run)) so they are required to -[reproduce](/doc/tutorials/get-started/reproduce) the corresponding -[pipeline](/doc/command-reference/pipeline). (See +[reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the +corresponding [pipeline](/doc/command-reference/pipeline). (See [DVC-File Format](/doc/user-guide/dvc-file-format) for more information on dependencies and outputs.) `dvc fetch` ensures that the files needed for a DVC-file to be -[reproduced](/doc/tutorials/get-started/reproduce) exist in cache. If no -`targets` are specified, the set of data files to fetch is determined by -analyzing all DVC-files in the current branch, unless `--all-branches` or -`--all-tags` is specified. +[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in +cache. If no `targets` are specified, the set of data files to fetch is +determined by analyzing all DVC-files in the current branch, unless +`--all-branches` or `--all-tags` is specified. The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. @@ -64,10 +64,10 @@ option is used. `dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands perform data synchronization among local and remote storage. The specific way in which the set of files to push/fetch/pull is determined begins with calculating -file hashes when these are [added](/doc/tutorials/get-started/add-files) with -DVC. File hashes are stored in the corresponding DVC-files (typically versioned -with Git). Only the hashes specified in DVC-files currently in the workspace are -considered by `dvc fetch` (unless the `-a` or `-T` options are used). +file hashes when these are [added](/doc/command-reference/add) with DVC. File +hashes are stored in the corresponding DVC-files (typically versioned with Git). +Only the hashes specified in DVC-files currently in the workspace are considered +by `dvc fetch` (unless the `-a` or `-T` options are used). ## Options diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 2ea7ba30ad..8f5730cf04 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -4,7 +4,7 @@ Download a file or directory from a supported URL (for example `s3://`, `ssh://`, and other protocols) into the local file system. > See `dvc get` to download data/model files or directories from other DVC -> repositories (e.g. hosted on GitHub). +> repositories (e.g. hosted on Github). ## Synopsis diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index ce3cd491c2..9b636773da 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -148,8 +148,8 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 location to place the target data within the workspace. Combining these two options allows us to do something we can't achieve with the regular `git checkout` + `dvc checkout` process โ€“ see for example the -[Get Older Data Version](/doc/tutorials/get-started/older-versions) chapter of -our _Get Started_. +[Get Older Data Version](/doc/tutorials/get-started/data-versioning#navigate-versions) +chapter of our _Get Started_. Let's use the [get started example repo](https://github.com/iterative/example-get-started) @@ -161,12 +161,13 @@ $ git clone https://github.com/iterative/example-get-started $ cd example-get-started ``` -If you are familiar with our [Get Started](/doc/tutorials/get-started) project -(used in these examples), you may remember that the chapter where we train a -first version of the model corresponds to the the `baseline-experiment` tag in -the repo. Similarly `bigrams-experiment` points to an improved model (trained -using bigrams). What if we wanted to have both versions of the model "checked -out" at the same time? `dvc get` provides an easy way to do this: +If you are familiar with the project in our +[Get Started](/doc/tutorials/get-started) (used in these examples), you may +remember that the chapter where we train a first version of the model +corresponds to the the `baseline-experiment` tag in the repo. Similarly +`bigrams-experiment` points to an improved model (trained using bigrams). What +if we wanted to have both versions of the model "checked out" at the same time? +`dvc get` provides an easy way to do this: ```dvc $ dvc get . model.pkl --rev baseline-experiment diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 6787def8c9..ab5894f4d0 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -5,7 +5,7 @@ Download a file or directory from a supported URL (for example `s3://`, changes in the remote data source. Creates a DVC-file. > See `dvc import` to download and tack data/model files or directories from -> other DVC repositories (e.g. hosted on GitHub). +> other DVC repositories (e.g. hosted on Github). ## Synopsis @@ -136,8 +136,8 @@ in the [Get Started](/doc/tutorials/get-started). Start by cloning our example repo if you don't already have it. Then move into the repo and checkout the [2-remote](https://github.com/iterative/example-get-started/releases/tag/2-remote) -tag, corresponding to the [Configure](/doc/tutorials/get-started/configure) _Get -Started_ chapter: +tag, corresponding to the [Configure](/doc/tutorials/get-started#configure) +section of the _Get Started_: ```dvc $ git clone https://github.com/iterative/example-get-started @@ -146,15 +146,16 @@ $ git checkout 2-remote $ mkdir data ``` -You should now have a blank workspace, just before the -[Add Files](/doc/tutorials/get-started/add-files) chapter. +You should now have a blank workspace, just before +[Versioning Basics](/doc/tutorials/get-started/data-versioning). ## Example: Tracking a remote file -An advanced alternate to [Add Files](/doc/tutorials/get-started/add-files) -chapter of the _Get Started_ is to use `dvc import-url`: +An advanced alternate to the intro of the +[Versioning Basics](/doc/tutorials/get-started/data-versioning) part of the _Get +Started_ is to use `dvc import-url`: ```dvc $ dvc import-url https://data.dvc.org/get-started/data.xml \ @@ -246,9 +247,9 @@ directory we created previously. (Its `path` has the URL for the data store.) And instead of an `etag` we have an `md5` hash value. We did this so its easy to edit the data file. -Let's now manually reproduce a -[processing chapter](/doc/tutorials/get-started/connect-code-and-data) from the -_Get Started_ project. Download the example source code archive and unzip it: +Let's now manually reproduce the +[data processing part](/doc/tutorials/get-started/data-pipelines) of the _Get +Started_ project. Download the example source code archive and unzip it: ```dvc $ wget https://code.dvc.org/get-started/code.zip diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index d30cb46253..f5d63979f5 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -181,7 +181,7 @@ $ dvc update --rev cats-dogs-v2 ## Example: Data registry If you take a look at our -[dataset-registry](https://github.com/iterative/dataset-registry) +[dataset registry](https://github.com/iterative/dataset-registry) project, you'll see that it's organized into different directories such as `tutorial/ver` and `use-cases/`, and these contain [DVC-files](/doc/user-guide/dvc-file-format) that track different datasets. diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 3fc55b6a95..1d9d95f6f2 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -10,9 +10,9 @@ usage: dvc init [-h] [-q | -v] [--no-scm] [-f] [--subdir] ## Description -DVC works on top of a Git repository by default. This enables all features, -providing the most value. It means that `dvc init` (without flags) expects to -run in a Git repository root (a `.git/` directory should be present). +DVC works best in a Git repository. This enables all features, providing the +most value. For this reason, `dvc init` (without flags) expects to run in a Git +repository root (a `.git/` directory should be present). The command [options](#options) can be used to start an alternative workflow for advanced scenarios: diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md index 5b6a2c6b50..398a5d4ee4 100644 --- a/content/docs/command-reference/metrics/show.md +++ b/content/docs/command-reference/metrics/show.md @@ -117,6 +117,7 @@ increase_bow: TP: 521 ``` -The [Compare Experiments](/doc/tutorials/get-started/compare-experiments) +The +[Compare Experiments](/doc/tutorials/get-started/experiments#compare-experiments) chapter of our _Get Started_ covers the `-a` option to collect and print a metric file value across all Git branches. diff --git a/content/docs/command-reference/pipeline/index.md b/content/docs/command-reference/pipeline/index.md index 6b4d6684e5..b9cf3eb4ae 100644 --- a/content/docs/command-reference/pipeline/index.md +++ b/content/docs/command-reference/pipeline/index.md @@ -1,6 +1,7 @@ # pipeline -A set of commands to manage [pipelines](/doc/tutorials/get-started/pipeline): +A set of commands to manage +[pipelines](/doc/tutorials/get-started/data-pipelines): [show](/doc/command-reference/pipeline/show) and [list](/doc/command-reference/pipeline/list). @@ -17,12 +18,13 @@ positional arguments: ## Description -A data pipeline, in general, is a series of data processes (for example console -commands that take an input and produce an output). A pipeline may -produce intermediate data, and has a final result. Machine Learning (ML) -pipelines typically start a with large raw datasets, include intermediate -featurization and training stages, and produce a final model, as well as -accuracy [metrics](/doc/command-reference/metrics). +A data pipeline, in general, is a series of data processing +[stages](/doc/command-reference/run) (for example console commands that take an +input and produce an output). A pipeline may produce intermediate +data, and has a final result. Machine learning (ML) pipelines typically start a +with large raw datasets, include intermediate featurization and training stages, +and produce a final model, as well as accuracy +[metrics](/doc/command-reference/metrics). In DVC, pipeline stages and commands, their data I/O, interdependencies, and results (intermediate or final) are specified with `dvc add` and `dvc run`, diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 572154eddb..04d9cce860 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -349,10 +349,9 @@ $ dvc remote add -d myremote https://example.com/path/to/dir A "local remote" is a directory in the machine's file system. > While the term may seem contradictory, it doesn't have to be. The "local" part -> refers to the machine where the project is stored, so it can be -> any directory accessible to the same system. The "remote" part refers -> specifically to the project/repository itself. Read "local, but external" -> storage. +> refers to the type of location where the storage is: another directory in the +> same file system. "Remote" is how we call storage for DVC +> projects. It's essentially a local backup for data tracked by DVC. Using an absolute path (recommended): diff --git a/content/docs/command-reference/remote/index.md b/content/docs/command-reference/remote/index.md index 09a006936b..e7b29a58e2 100644 --- a/content/docs/command-reference/remote/index.md +++ b/content/docs/command-reference/remote/index.md @@ -26,7 +26,7 @@ positional arguments: What is data remote? -The same way as GitHub provides storage hosting for Git repositories, DVC +The same way as Github provides storage hosting for Git repositories, DVC remotes provide a central place to keep and share data and model files. With this remote storage, you can pull models and data files created by colleagues without spending time and resources to build or process them locally. It also @@ -76,9 +76,9 @@ For the typical process to share the project via remote, see ### What is a "local remote" ? While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the machine where the project is stored, so it can be any -directory accessible to the same system. The "remote" part refers specifically -to the project/repository itself. Read "local, but external" storage. +refers to the type of location where the storage is: another directory in the +same file system. "Remote" is how we call storage for DVC projects. +It's essentially a local backup for data tracked by DVC. diff --git a/content/docs/command-reference/remote/list.md b/content/docs/command-reference/remote/list.md index 78aa5068cd..45f3e609fd 100644 --- a/content/docs/command-reference/remote/list.md +++ b/content/docs/command-reference/remote/list.md @@ -33,16 +33,16 @@ including names and URLs. ## Examples -Let's for simplicity add a _default_ local remote: +For simplicity, let's add a default local remote:
### What is a "local remote" ? While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the machine where the project is stored, so it can be any -directory accessible to the same system. The "remote" part refers specifically -to the project/repository itself. Read "local, but external" storage. +refers to the type of location where the storage is: another directory in the +same file system. "Remote" is how we call storage for DVC projects. +It's essentially a local backup for data tracked by DVC.
diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index c5aba0713f..22964eee3b 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -178,10 +178,10 @@ these settings, you could use the following options: $ dvc remote modify myremote grant_full_control id=aws-canonical-user-id,id=another-aws-canonical-user-id ``` - > \* - `grant_read`, `grant_read_acp`, `grant_write_acp` and + > \* `grant_read`, `grant_read_acp`, `grant_write_acp` and > `grant_full_control` params are mutually exclusive with `acl`. > - > \*\* - default ACL grantees are overwritten. Grantees are AWS accounts + > \*\* default ACL grantees are overwritten. Grantees are AWS accounts > identifiable by `id` (AWS Canonical User ID), `emailAddress` or `uri` > (predefined group). diff --git a/content/docs/index.md b/content/docs/index.md index aedc913bd0..ab4da65287 100644 --- a/content/docs/index.md +++ b/content/docs/index.md @@ -1,34 +1,26 @@ # DVC Documentation -Welcome! In here you may find all the guiding material and technical documents -needed to learn about DVC: how to use it, how it works, and where to go for -additional resources. +Data Version Control, or DVC, is a data and ML experiments management tool that +takes advantage of the existing engineering toolset that you're already familiar +with (Git, CI/CD, etc.) - + + A step-by-step introduction into basic DVC features + -A step-by-step introduction into basic DVC features + + Study the detailed inner-workings of DVC in its user guide. + - + + Non-exhaustive list of scenarios DVC can help with + - - -Study the detailed inner-workings of DVC in its user guide. - - - - - -Non-exhaustive list of scenarios DVC can help with - - - - - -See all of DVC's commands. - - + + See all of DVC's commands. + diff --git a/content/docs/install/macos.md b/content/docs/install/macos.md index 0092b4a1e5..5464db3f87 100644 --- a/content/docs/install/macos.md +++ b/content/docs/install/macos.md @@ -14,7 +14,7 @@ $ brew install dvc ## Install from package Get the PKG (binary) from the big "Download" button on the [home page](/), or -from the [release page](https://github.com/iterative/dvc/releases/) on GitHub. +from the [release page](https://github.com/iterative/dvc/releases/) on Github. > Note that currently, in order to open the PKG file, you must go to the > Downloads directory in Finder and use diff --git a/content/docs/install/pre-release.md b/content/docs/install/pre-release.md index 9e4c378dc9..bf08ec7034 100644 --- a/content/docs/install/pre-release.md +++ b/content/docs/install/pre-release.md @@ -1,7 +1,7 @@ # Install Pre-release Version If you want to test the latest stable version of DVC, ahead of official -releases, you can install it from our code repository GitHub. +releases, you can install it from our code repository Github. > We **strongly** recommend creating a > [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 752500a84e..62f5071126 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -32,32 +32,15 @@ "children": [ { "slug": "get-started", - "source": false, + "source": "get-started/index.md", "tutorials": { "katacoda": "https://katacoda.com/dvc/courses/get-started/initialize" }, "children": [ - "agenda", - "initialize", - "configure", - "add-files", - "store-data", - "retrieve-data", - "import-data", - { - "label": "Connect with Code", - "slug": "connect-code-and-data" - }, - "pipeline", - "visualize", - "reproduce", - "metrics", - "experiments", - "compare-experiments", - { - "label": "Get Older Files", - "slug": "older-versions" - } + "data-versioning", + "data-pipelines", + "data-access", + "experiments" ] }, { diff --git a/content/docs/tutorials/deep/define-ml-pipeline.md b/content/docs/tutorials/deep/define-ml-pipeline.md index 9db5a5ee93..bb4fde619f 100644 --- a/content/docs/tutorials/deep/define-ml-pipeline.md +++ b/content/docs/tutorials/deep/define-ml-pipeline.md @@ -13,21 +13,6 @@ $ du -sh data/* 41M data/Posts.xml.zip ``` -
- -### Expand to learn how to download on Windows - -Windows doesn't include the `wget` utility by default, but you can use the -browser to download `data.xml`. (Right-click -[this link](https://data.dvc.org/tutorial/ver/data.zip) and select -`Save Link As...` (Chrome). Save it into the `data/` subdirectory. - -> Please also review -> [Running DVC on Windows](/doc/user-guide/running-dvc-on-windows) for important -> tips to improve your experience using DVC on Windows. - -
- At this time, `data/Posts.xml.zip` is a regular (untracked) file. We can track it with DVC using `dvc add` (see below). After executing the command you will see a new file `data/Posts.xml.zip.dvc` and a change in `data/.gitignore`. Both diff --git a/content/docs/tutorials/deep/preparation.md b/content/docs/tutorials/deep/preparation.md index 7a79ab483a..1983c9099b 100644 --- a/content/docs/tutorials/deep/preparation.md +++ b/content/docs/tutorials/deep/preparation.md @@ -11,7 +11,7 @@ modifying some of the code during this tutorial to improve the model. > We have tested our tutorials and examples with Python 3. We don't recommend > using earlier versions. -You'll need [Git](https://git-scm.com) to run the commands in this tutorial. +You'll need [Git](https://git-scm.com/) to run the commands in this tutorial. Also, if DVC is not installed, please follow these [instructions](/doc/install) to do so. @@ -33,6 +33,10 @@ browser to download `code.zip`. (Right-click [this link](https://code.dvc.org/tutorial/nlp/code.zip) and select `Save Link As...` (Chrome). Save it into the project directory. +> ๐Ÿ’ก Please also review +> [Running DVC on Windows](/doc/user-guide/running-dvc-on-windows) for important +> tips to improve your experience using DVC on Windows. + ```dvc diff --git a/content/docs/tutorials/deep/sharing-data.md b/content/docs/tutorials/deep/sharing-data.md index 8591107b36..199dbe59f7 100644 --- a/content/docs/tutorials/deep/sharing-data.md +++ b/content/docs/tutorials/deep/sharing-data.md @@ -22,7 +22,7 @@ can be done using the CLI as shown below. > have write access to it, so in order to follow the tutorial you will need to > either create your own S3 bucket or use other types of > [remote storage](/doc/command-reference/remote). E.g. you can set up a local -> remote as we did in the [Configure](/doc/tutorials/get-started/configure) +> remote as we did in the [Configure](/doc/tutorials/get-started#configure) > chapter of _Get Started_. ```dvc diff --git a/content/docs/tutorials/get-started/add-files.md b/content/docs/tutorials/get-started/add-files.md deleted file mode 100644 index 438ef9495d..0000000000 --- a/content/docs/tutorials/get-started/add-files.md +++ /dev/null @@ -1,89 +0,0 @@ -# Add Files or Directories - -DVC allows storing and versioning data files, ML models, directories, -intermediate results with Git, without tracking the file contents with Git. -Let's get a dataset example to play with: - -```dvc -$ mkdir data -$ dvc get https://github.com/iterative/dataset-registry \ - get-started/data.xml -o data/data.xml -``` - -> `dvc get` can use any DVC repository to find the appropriate -> [remote storage](/doc/command-reference/remote) and download data -> artifacts from it (analogous to `wget`, but for repositories). In this -> case we use [dataset-registry](https://github.com/iterative/dataset-registry)) -> as the source repo. (Refer to -> [Data Registries](/doc/use-cases/data-registries) for more info about this -> setup.) - -To track a file (or a directory) with DVC just run `dvc add` on it. For example: - -```dvc -$ dvc add data/data.xml -``` - -DVC stores information about the added data in a special file called a -**DVC-file**. DVC-files are small text files with a human-readable -[format](/doc/user-guide/dvc-file-format) and they can be committed with Git: - -```dvc -$ git add data/.gitignore data/data.xml.dvc -$ git commit -m "Add raw data to project" -``` - -Committing DVC-files with Git allows us to track different versions of the -project data as it evolves with the source code tracked by Git. - -
- -### Expand to learn about DVC internals - -`dvc add` moves the actual data file to the cache directory (see -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories)), while -the entries in the workspace may be file links to the actual files in the DVC -cache. - -```dvc -$ ls -R .dvc/cache - .dvc/cache/a3: - 04afb96060aad90176268345e10355 -``` - -`a304afb96060aad90176268345e10355` above is the hash value of the `data.xml` -file we just added with DVC. If you check the `data/data.xml.dvc` DVC-file, you -will see that it has this string inside. - -### Important note on cache performance - -DVC tries to use reflinks\* by default to link your data files from the DVC -cache to the workspace, optimizing speed and storage space. However, reflinks -are not widely supported yet and DVC falls back to actually copying data files -to/from the cache. **Copying can be very slow with large files**, and duplicates -storage requirements. - -Hardlinks and symlinks are also available for optimized cache linking but, -(unlike reflinks) they carry the risk of accidentally corrupting the cache if -tracked data files are modified in the workspace. - -See [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and -`dvc config cache` for more information. - -> \***copy-on-write links or "reflinks"** are a relatively new way to link files -> in UNIX-style file systems. Unlike hardlinks or symlinks, they support -> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This -> means that editing a reflinked file is always safe as all the other links to -> the file will reflect the changes. - -
- -If your workspace uses Git, without DVC you would have to manually put each data -file or directory into `.gitignore`. DVC commands that track data files -automatically takes care of this for you! (You just have to add the changes with -Git.) - -Refer to -[Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files), -`dvc add`, and `dvc run` for more information on storing and versioning data -files with DVC. diff --git a/content/docs/tutorials/get-started/agenda.md b/content/docs/tutorials/get-started/agenda.md deleted file mode 100644 index ca56dfbdec..0000000000 --- a/content/docs/tutorials/get-started/agenda.md +++ /dev/null @@ -1,39 +0,0 @@ -# Agenda - -You'll need [Git](https://git-scm.com) to run the commands in this guide. Also, -if DVC is not installed, please follow these [instructions](/doc/install) to do -so. - -In the next few sections we'll build a simple natural language processing (NLP) -project from scratch. If you'd like to get the final result or have any issues -along the way, you can download the fully reproducible -[GitHub project](https://github.com/iterative/example-get-started) by running: - -```dvc -$ git clone https://github.com/iterative/example-get-started -``` - -Otherwise, bear with us and we'll introduce some basic DVC concepts to get the -same results together! - -The idea for this project is a simplified version of our -[Deep Dive Tutorial](/doc/tutorials/deep). It explores the NLP problem of -predicting tags for a given StackOverflow question. For example, we might want a -classifier that can classify (or predict) posts about Python by tagging them -with `python`. - -![](/img/example-flow-2x.png) - -This is a natural language processing context, but NLP isn't the only area of -data science where DVC can help. DVC is designed to be agnostic of frameworks, -languages, etc. If you have data files or datasets and/or you produce data -files, models, or datasets and you want to: - -- Capture and save those data artifacts the same way you capture - code -- Track and switch between different versions of data easily -- Understand how data artifacts (e.g. ML models) were built in the first place -- Be able to compare models to each other -- Bring software best practices to your team and get everyone on the same page - -Then you're in the right place! Click the `Next` button below to start โ†˜ diff --git a/content/docs/tutorials/get-started/compare-experiments.md b/content/docs/tutorials/get-started/compare-experiments.md deleted file mode 100644 index 21e09a8450..0000000000 --- a/content/docs/tutorials/get-started/compare-experiments.md +++ /dev/null @@ -1,42 +0,0 @@ -# Compare Experiments - -DVC makes it easy to iterate on your project using Git commits with tags or Git -branches. It provides a way to try different ideas, keep track of them, switch -back and forth. To find the best performing experiment or track the progress, -[project metrics](/doc/command-reference/metrics) are supported in DVC (as -described in one of the previous chapters). - -Let's run evaluate for the latest `bigrams` experiment we created in previous -chapters. It mostly takes just running the `dvc repro`: - -```dvc -$ git checkout master -$ dvc checkout -$ dvc repro evaluate.dvc -``` - -`git checkout master` and `dvc checkout` commands ensure that we have the latest -experiment code and data respectively. And `dvc repro`, as we discussed in the -[Reproduce](/doc/tutorials/get-started/reproduce) chapter, is a way to run all -the necessary commands to build the model and measure its performance. - -```dvc -$ git commit -am "Evaluate bigrams model" -$ git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation" -``` - -Now, we can use `-T` option of the `dvc metrics show` command to see the -difference between the `baseline` and `bigrams` experiments: - -```dvc -$ dvc metrics show -T - -baseline-experiment: - auc.metric: 0.588426 -bigrams-experiment: - auc.metric: 0.602818 -``` - -DVC provides built-in support to track and navigate `JSON`, `TSV` or `CSV` -metric files if you want to track additional information. See `dvc metrics` to -learn more. diff --git a/content/docs/tutorials/get-started/configure.md b/content/docs/tutorials/get-started/configure.md deleted file mode 100644 index cb3b429f50..0000000000 --- a/content/docs/tutorials/get-started/configure.md +++ /dev/null @@ -1,67 +0,0 @@ -# Configure - -Once you install DVC, you'll be able to start using it (in its local setup) -immediately. - -However, remote will be required (see `dvc remote`) if you need to share data or -models outside of the context of a single project, for example with other -collaborators or even with yourself in a different computing environment. It's -similar to the way you would use GitHub or any other Git server to store and -share your code. - -For simplicity, let's setup a local remote: - -
- -### What is a "local remote" ? - -While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the machine where the project is stored, so it can be any -directory accessible to the same system. The "remote" part refers specifically -to the project/repository itself. Read "local, but external" storage. - -
- -```dvc -$ dvc remote add -d myremote /tmp/dvc-storage -$ git commit .dvc/config -m "Configure local remote" -``` - -> We only use a local remote in this section for simplicity's sake as you learn -> to use DVC. For most [use cases](/doc/use-cases), other "more remote" types of -> remotes will be required. - -[Adding a remote](/doc/command-reference/remote/add) should be specified by both -its type (protocol) and its path. DVC currently supports these types of remotes: - -- `s3`: Amazon Simple Storage Service -- `azure`: Microsoft Azure Blob Storage -- `gdrive` : Google Drive -- `gs`: Google Cloud Storage -- `ssh`: Secure Shell (requires SFTP) -- `hdfs`: Hadoop Distributed File System -- `http`: HTTP and HTTPS protocols -- `local`: Directory in the local file system - -> If you installed DVC via `pip` and plan to use cloud services as remote -> storage, you might need to install these optional dependencies: `[s3]`, -> `[azure]`, `[gdrive]`, `[gs]`, `[oss]`, `[ssh]`. Alternatively, use `[all]` to -> include them all. The command should look like this: `pip install "dvc[s3]"`. -> (This example installs `boto3` library along with DVC to support S3 storage.) - -For example, to setup an S3 remote we would use something like this (make sure -that `mybucket` exists): - -```dvc -$ dvc remote add -d s3remote s3://mybucket/myproject -``` - -> This command is only shown for informational purposes. No need to actually run -> it in order to continue with the Get Started. - -You can see that DVC doesn't require installing any databases, servers, or -warehouses. It can use bare S3 or SSH to store data, intermediate results, and -models. - -See `dvc config` to get information about more configuration options and -`dvc remote` to learn more about remotes and get more examples. diff --git a/content/docs/tutorials/get-started/connect-code-and-data.md b/content/docs/tutorials/get-started/connect-code-and-data.md deleted file mode 100644 index 1bec301c0c..0000000000 --- a/content/docs/tutorials/get-started/connect-code-and-data.md +++ /dev/null @@ -1,165 +0,0 @@ -# Connect Code and Data - -Even in its basic scenarios, commands like `dvc add`, `dvc push`, `dvc pull` -described in the previous sections could be used independently and provide a -basic useful framework to track, save and share models and large data files. To -achieve full reproducibility though, we'll have to connect code and -configuration with the data it processes to produce the result. - -
- -### Expand to prepare example code - -If you've followed this _Get Started_ section from the beginning, run these -commands to get the example code: - -```dvc -$ wget https://code.dvc.org/get-started/code.zip -$ unzip code.zip -$ rm -f code.zip -``` - -Windows doesn't include the `wget` utility by default, but you can use the -browser to download `code.zip`. (Right-click -[this link](https://code.dvc.org/get-started/code.zip) and select -`Save Link As...` (Chrome). Save it into the project directory. - -The workspace should now look like this: - -```dvc -$ tree -. -โ”œโ”€โ”€ data -โ”‚ย ย  โ”œโ”€โ”€ data.xml -โ”‚ย ย  โ””โ”€โ”€ data.xml.dvc -โ””โ”€โ”€ src - ย ย  โ”œโ”€โ”€ evaluate.py - ย ย  โ”œโ”€โ”€ featurization.py - ย ย  โ”œโ”€โ”€ prepare.py - ย ย  โ”œโ”€โ”€ requirements.txt - ย โ””โ”€โ”€ train.py -``` - -Now let's install the requirements. But before we do that, we **strongly** -recommend creating a -[virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments): - -```dvc -$ virtualenv -p python3 .env -$ echo ".env/" >> .gitignore -$ source .env/bin/activate -$ pip install -r src/requirements.txt -``` - -Optionally, save the progress with Git: - -```dvc -$ git add . -$ git commit -m "Add source code files to repo" -``` - -
- -Having installed the `src/prepare.py` script in your repo, the following command -transforms it into a reproducible [stage](/doc/command-reference/run) for the ML -pipeline we're building (described in the -[next chapter](/doc/tutorials/pipelines)). - -```dvc -$ dvc run -f prepare.dvc \ - -d src/prepare.py -d data/data.xml \ - -o data/prepared \ - python src/prepare.py data/data.xml -``` - -`dvc run` generates the `prepare.dvc` DVC-file. It has the same -[format](/doc/user-guide/dvc-file-format) as the file we created in the -[previous section](/doc/tutorials/get-started/add-files) to track `data.xml`, -except in this case it has additional information about the `data/prepared` -output (a directory where two files, `train.tsv` and `test.tsv`, will be written -to), and about the Python command that is required to build it. - -
- -### Expand to learn more about what has just happened - -This is how the result should look like now: - -```diff - . - โ”œโ”€โ”€ data - โ”‚ โ”œโ”€โ”€ data.xml - โ”‚ โ”œโ”€โ”€ data.xml.dvc -+ โ”‚ โ””โ”€โ”€ prepared -+ โ”‚ โ”œโ”€โ”€ test.tsv -+ โ”‚ โ””โ”€โ”€ train.tsv -+ โ”œโ”€โ”€ prepare.dvc - โ””โ”€โ”€ src - โ”œโ”€โ”€ evaluate.py - โ”œโ”€โ”€ featurization.py - โ”œโ”€โ”€ prepare.py - โ”œโ”€โ”€ requirements.txt - โ””โ”€โ”€ train.py -``` - -This is how `prepare.dvc` looks like: - -```yaml -cmd: python src/prepare.py data/data.xml -deps: - - md5: b4801c88a83f3bf5024c19a942993a48 - path: src/prepare.py - - md5: a304afb96060aad90176268345e10355 - path: data/data.xml -md5: c3a73109be6c186b9d72e714bcedaddb -outs: - - cache: true - md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir - metric: false - path: data/prepared -wdir: . -``` - -> `dvc run` is just the first of a set of DVC command required to generate a -> [pipeline](/doc/tutorials/get-started/pipeline), or in other words, -> instructions on how to build a ML model (data file) from previous data files -> (or directories). - -Let's briefly mention what the command options used above mean for this -particular example: - -`-f prepare.dvc` specifies a name for the DVC-file (pipeline stage). It's -optional but we recommend using it to make your project structure more readable. - -`-d src/prepare.py` and `-d data/data.xml` mean that the `prepare.dvc` stage -file depends on them to produce the result. When you run `dvc repro` next time -(see next chapter) DVC will automatically check these dependencies and decide -whether this stage is up to date or whether it should be executed to regenerate -its outputs. - -`-o data/prepared` specifies the output directory processed data will be put -into. The script creates two files in it โ€“ that will be used later to generate -features, train and evaluate the model. - -And, the last line, `python src/prepare.py data/data.xml`, specifies a command -to run. This command is saved to the generated DVC-file, and used later by -`dvc repro`. - -Hopefully, `dvc run` (and `dvc repro`) will become intuitive after a few more -Get Started chapters. You can always refer to the the command references for -more details on their behavior and options. - -
- -You don't need to run `dvc add` to track output files (`prepared/train.tsv` and -`prepared/test.tsv`) with DVC. `dvc run` takes care of this. You only need to -run `dvc push` (usually along with `git commit`) to save them to the remote when -you are done. - -Let's commit the changes to save the stage we built: - -```dvc -$ git add data/.gitignore prepare.dvc -$ git commit -m "Create data preparation stage" -$ dvc push -``` diff --git a/content/docs/tutorials/get-started/data-access.md b/content/docs/tutorials/get-started/data-access.md new file mode 100644 index 0000000000..9416e69a1c --- /dev/null +++ b/content/docs/tutorials/get-started/data-access.md @@ -0,0 +1,101 @@ +# Data Access + +We've seen how to [version data](/doc/tutorials/get-started/data-versioning) for +sharing among team members or environments of the same project. But what about +reusing your data and models from an existing DVC project in other projects, or +on a production deployment? + +DVC repositories serve as an entry point for your data. The +following CLI commands and an API are available to access any version of it, +from any machine where DVC is installed. + +## Find a dataset + +You can use `dvc list` to explore a DVC repository hosted on any +Git server. For example, let's see what's in the `use-cases/` directory of out +[dataset-registry](https://github.com/iterative/dataset-registry) repo: + +```dvc +$ dvc list https://github.com/iterative/dataset-registry use-cases +.gitignore +cats-dogs +cats-dogs.dvc +``` + +The benefit of this command over browsing a Git hosting website is that the list +includes files and directories tracked by **both Git and DVC**. + +## Just download it + +One way is to simply download the data with `dvc get`. This is useful when +working outside of a DVC project environment, for example in an +automated ML model deployment task: + +```dvc +$ dvc get https://github.com/iterative/dataset-registry \ + use-cases/cats-dogs +``` + +When working inside another DVC project though, this is not the best strategy +because the connection between the projects is lost โ€” others won't know where +the data came from or whether new versions are available. + +## Import the dataset + +> Requires an [initialized](/doc/tutorials/get-started#initialize) DVC +> project. + +`dvc import` downloads a dataset, while also tracking it **in the same step**: + +```dvc +$ dvc import https://github.com/iterative/dataset-registry \ + use-cases/cats-dogs +``` + +This is similar to `dvc get`+`dvc add`, but the resulting +[DVC-file](/doc/user-guide/dvc-file-format) includes metadata to track changes +in the source repository. This allows you to bring in changes from the data +source later, using `dvc update`. + +
+ +#### Expand to see what happened internally + +> Note that the +> [dataset registry](https://github.com/iterative/dataset-registry) repository +> doesn't actually contain a `cats-dogs/` directory. Like `dvc get`, +> `dvc import` downloads from [remote storage](/doc/command-reference/remote). + +DVC-files created by `dvc import` are called _import stages_. These have special +fields, such as the data source `repo`, and `path` (under `deps`): + +```yaml +deps: + path: use-cases/cats-dogs + repo: + url: https://github.com/iterative/dataset-registry + rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62 +``` + +The `url` and `rev_lock` subfields under `repo` are used to save the origin and +[version](https://git-scm.com/docs/revisions) of the dependency, respectively. + +
+ +## Python API + +It's also possible to integrate your data or models directly in source code with +DVC's _Python API_. This lets you **access the data contents directly** from +within an application at runtime. For example: + +```py +import dvc.api + +with dvc.api.open( + 'use-cases/cats-dogs', + repo='https://github.com/iterative/dataset-registry' + ) as fd: + # ... fd is a file descriptor that can be processed normally. +``` + +๐Ÿ“– Please refer to the [DVC Python API](/doc/api-reference) for more details. diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md new file mode 100644 index 0000000000..977b7978fd --- /dev/null +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -0,0 +1,231 @@ +# Data Pipelines + +Versioning large data files and directories for data science is great, but not +enough. How is data filtered, transformed, or used to train ML models? DVC +introduces a mechanism to capture _data pipelines_ โ€” **series of data +processes** that produce a final result. + +DVC pipelines and their data can also be easily versioned (using Git). This +allows you to better organize your project, and reproduce your workflow and +results later exactly as they were built originally! + +
+ +### ๐Ÿ‘‰ Expand to prepare the project + +Get the sample project from Github with: + +```dvc +$ git clone https://github.com/iterative/example-get-started +$ cd example-get-started +$ git checkout '4-update-data' +$ dvc pull +``` + +
+ +## Pipeline stages + +Use `dvc run` to create _stages_. These represent processes (source code tracked +with Git) that form the **steps of a pipeline**. Staged also connect such code +to its data input and output. Let's transform a Python script into a +[stage](/doc/command-reference/run): + +
+ +### ๐Ÿ‘‰ Expand to download example code + +Get the sample code like this: + +```dvc +$ wget https://code.dvc.org/get-started/code.zip +$ unzip code.zip +$ rm -f code.zip +$ ls src +cleanup.py evaluate.py featurization.py +prepare.py requirements.txt train.py +``` + +Now let's install the requirements: + +> We **strongly** recommend creating a +> [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) +> first. + +```dvc +$ pip install -r src/requirements.txt +``` + +Please also add or commit the source code directory with Git at this point. + +
+ +```dvc +$ dvc run -f prepare.dvc \ + -d src/prepare.py -d data/data.xml \ + -o data/prepared \ + python src/prepare.py data/data.xml data/prepared +``` + +A `prepare.dvc` _stage file_ is generated with the same +[format](/doc/user-guide/dvc-file-format) as the DVC-file we created previously +to +[tack existing data](/doc/tutorials/get-started/data-versioning#tracking-changes). +Additionally, it includes information about the command we ran +(`python src/prepare.py`), its dependencies, and +outputs. + +
+ +### Expand to see what happened internally + +The command options used above mean the following: + +- `-f prepare.dvc` specifies a name for the stage file. It's optional but we + recommend using it to make your project structure more readable. + +- `-d src/prepare.py` and `-d data/data.xml` mean that the stage depends on + these files to work. Notice that the source code itself is marked as a + dependency. If any of these files change later, DVC will know that this stage + needs to be [reproduced](#reproduce). + +- `-o data/prepared` specifies an output directory for this script, which writes + two files in it. This is how the workspace should look like now: + + ```diff + . + โ”œโ”€โ”€ data + โ”‚ โ”œโ”€โ”€ data.xml + โ”‚ โ”œโ”€โ”€ data.xml.dvc + + โ”‚ โ””โ”€โ”€ prepared + + โ”‚ โ”œโ”€โ”€ test.tsv + + โ”‚ โ””โ”€โ”€ train.tsv + + โ”œโ”€โ”€ prepare.dvc + โ””โ”€โ”€ src + โ”œโ”€โ”€ ... + ``` + +- The last line, `python src/prepare.py ...`, is the command to run in this + stage, and it's saved to the stage file, as shown below. + +The resulting import stage `prepare.dvc` contains all of the information above: + +```yaml +cmd: python src/prepare.py data/data.xml data/prepared +deps: + - md5: 1a18704abffac804adf2d5c4549f00f7 + path: src/prepare.py + - md5: a304afb96060aad90176268345e10355 + path: data/data.xml +outs: + - md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir + path: data/prepared + cache: true +``` + +
+ +### Tracking and versioning stages + +There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared` +in this case); `dvc run` already took care of this. You only need to run +`dvc push` if you want to save them to +[remote storage](/doc/tutorials/get-started/data-versioning#storing-and-sharing), +(usually along with `git commit` to version the stage file itself). + +## Dependency graphs (DAGs) + +By using `dvc run` multiple times, and specifying outputs of a +stage as dependencies of another one, we can describe a sequence of +commands that gets to a desired result. This is what we call a _data pipeline_ +or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph). + +Let's create a second stage chained to the outputs of `prepare.dvc`, to perform +feature extraction. And a third one for training a machine learning model, based +on the features: + +```dvc +$ dvc run -f featurize.dvc \ + -d src/featurization.py -d data/prepared \ + -o data/features \ + python src/featurization.py data/prepared data/features + +$ dvc run -f train.dvc \ + -d src/train.py -d data/features \ + -o model.pkl \ + python src/train.py data/features model.pkl +``` + +This would be a good point to commit the changes with Git. This includes any +`.gitignore` files, and all the stage files that describe our pipeline so far. + +> ๐Ÿ“– See also the `dvc pipeline` command. + +## Reproduce + +Imagine you're just cloning the repository created so far, in +another computer. It's extremely easy for anyone to reproduce the result +end-to-end, by using `dvc repro`. + +
+ +### ๐Ÿ‘‰ Expand to simulate a fresh clone of this repo + +Move to another location in your file system and do this: + +```dvc +$ git clone https://github.com/iterative/example-get-started +$ cd example-get-started +$ git checkout 7-train +$ dvc unlock data/data.xml.dvc +``` + +
+ +```dvc +$ dvc repro train.dvc +``` + +`train.dvc` is used because it's the last stage file so far; It describes what +code and data to use to regenerate a final result (ML model). For stages that +output any of its dependencies, we can in turn get the +same info, and so on. + +`dvc repro` rebuilds this [dependency graph](#dependency-graphs-dags) and +executes the necessary commands to rebuild all the pipeline +artifacts. + +## Visualize + +Having built our pipeline, we need a good way to understand its structure. +Seeing a graph of connected stage files would help. DVC lets you do just that, +without leaving the terminal! + +```dvc +$ dvc pipeline show --ascii train.dvc + +-------------------+ + | data/data.xml.dvc | + +-------------------+ + * + * + * + +-------------+ + | prepare.dvc | + +-------------+ + * + * + * + +---------------+ + | featurize.dvc | + +---------------+ + * + * + * + +-----------+ + | train.dvc | + +-----------+ +``` + +> We are using the `--ascii` option above to better illustrate this pipeline. +> Please, refer to `dvc pipeline show` to explore other options this command +> supports. diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md new file mode 100644 index 0000000000..650beb3787 --- /dev/null +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -0,0 +1,260 @@ +# Data Versioning + +To **track** a large file or directory, place it in the workspace, +and use `dvc add`: + +
+ +### ๐Ÿ‘‰ Expand to get an example dataset + +Having [initialized](/doc/tutorials/get-started#initialize) a project, do this: + +```dvc +$ dvc get --rev cats-dogs-v1 \ + https://github.com/iterative/dataset-registry \ + use-cases/cats-dogs -o datadir +``` + +> `dvc get` can download any data artifact tracked in a DVC +> repository. It's like `wget`, but for DVC or Git repos. In this case we +> use a specific version (`cats-dogs-v1` tag) of our +> [dataset registry](https://github.com/iterative/dataset-registry) repo as the +> data source. + +Note that while the source data directory was called `cats-dogs/`, we are able +to rename it locally to `datadir/`. + +
+ +```dvc +$ dvc add datadir +``` + +DVC stores information about the added directory in a special _DVC-file_ named +`datadir.dvc`, a small text file with a human-readable +[format](/doc/user-guide/dvc-file-format). This file can be easily **versioned +like source code** with Git, as a placeholder for the original data (which is +listed in `.gitignore`): + +```dvc +$ git add .gitignore datadir.dvc +$ git commit -m "Add raw data" +``` + +
+ +### Expand to see what happened internally + +`dvc add` moved the data to the project's cache, and linked\* it +back to the workspace. + +```dvc +$ ls -R .dvc/cache +... +.dvc/cache/a3: +04afb96060aad90176268345e10355.dir +``` + +The hash value of the `datadir/` directory we just added (`a304afb...`) +determines the cache path shown above. And if you check `datadir.dvc`, you will +find it there too: + +```yaml +outs: + - md5: a304afb96060aad90176268345e10355 + path: datadir + cache: true +``` + +> \* See +> [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and +> `dvc config cache` for more information on file linking. + +
+ +## Tracking changes + +`dvc status` can notice when tracked data has changed (among other situations). +To record a new version of the data, just use `dvc add` again: + +
+ +### ๐Ÿ‘‰ Expand to get an updated dataset + +```dvc +$ dvc get --rev cats-dogs-v2 \ + https://github.com/iterative/dataset-registry \ + use-cases/cats-dogs -o datadir +``` + +
+ +```dvc +$ dvc status +datadir.dvc: + changed outs: + modified: datadir +$ dvc add datadir +``` + +DVC caches the changes to the `datadir/` directory, and updates the +`datadir.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to match the changes. +Let's commit this new version with Git: + +
+ +### Expand to see what happened internally + +Use `git diff` to show the change in `datadir.dvc`: + +```diff + outs: +-- md5: a304afb96060aad90176268345e10355 ++- md5: 558a00881d4a6815ba625c13e27c5b7e + path: datadir + cache: true +``` + +Since the contents of `datadir/` changed, its hash value is updated (to +`558a008...`). + +
+ +```dvc +$ git add datadir.dvc +$ git commit -m "Change data" +``` + +## Switching versions + +When we have more than one data version, we may want to switch between them. We +can use `dvc checkout` for this. Let's say we want to revert back to the first +`datadir/`: + +```dvc +$ git checkout HEAD^ datadir.dvc +$ dvc checkout datadir.dvc +``` + +
+ +### Expand to see what happened internally + +`git checkout` brought the `datadir.dvc` DVC-file back to the version, with the +previous hash value of the data (`a304afb...`): + +```yaml +outs: + md5: a304afb96060aad90176268345e10355 + path: datadir +``` + +All `dvc checkout` does is putting the corresponding files, stored in the +cache, back into the workspace. This brings +DVC-tracked data up to date with the current Git commit. + +
+ +> Note that you can use `dvc install` to set up a Git hooks that automate common +> actions, like checking out DVC-tracked data after every Git checkout. + +## Storing and sharing + +You can **upload** DVC-tracked data or models with `dvc push`, so they're safely +stored [remotely](/doc/command-reference/remote). This also means they can be +retrieved on other environments later. + +
+ +### ๐Ÿ‘‰ Set up remote storage first + +DVC remotes let you store a copy of the data tracked by DVC outside of the local +cache, usually a **cloud storage** service. For simplicity, let's set up a +_local remote_: + +```dvc +$ mkdir -p /tmp/dvc-storage +$ dvc remote add -d myremote /tmp/dvc-storage +$ git commit .dvc/config -m "Configure local remote" +``` + +> While the term "local remote" may seem contradictory, it doesn't have to be. +> The "local" part refers to the type of location: another directory in the file +> system. "Remote" is how we call storage for DVC projects. It's +> essentially a local data backup. + +๐Ÿ’ก DVC supports the following **remote storage types**: Google Drive, Amazon S3, +Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. +Please refer to `dvc remote add` for more details and examples. + +
+ +```dvc +$ dvc push +``` + +Usually, we also want to `git commit` and `git push` the corresponding +[DVC-files](/doc/user-guide/dvc-file-format). + +
+ +### Expand to see what happened internally + +`dvc push` copied the data cached locally to the remote storage we +set up earlier. You can check that the data has been stored in the DVC remote +with: + +```dvc +$ ls -R /tmp/dvc-storage +... +/tmp/dvc-storage/55: +8a00881d4a6815ba625c13e27c5b7e +/tmp/dvc-storage/a3: +04afb96060aad90176268345e10355 +``` + +Note that both versions of the data are stored. (This should match +`.dvc/cache`.) + +
+ +## Retrieving + +Having DVC-tracked data stored remotely, it can be **downloaded** when needed in +other copies of this project with `dvc pull`. Usually, we run it +after `git clone` and `git pull`. + +
+ +### ๐Ÿ‘‰ Expand to simulate a fresh clone of this repo + +Let's just remove the directory added so far, both from workspace +and cache: + +```dvc +$ rm -f datadir .dvc/cache/a3/04afb96060aad90176268345e10355 +$ dvc status +datadir.dvc: + changed outs: + deleted: datadir +``` + +`dvc status` detects when DVC-tracked data is missing (among other situations). + +
+ +```dvc +$ dvc pull +``` + +> ๐Ÿ“– See also +> [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) +> for more on basic collaboration workflows. + +## Other ways to track data + +In the [Pipelines](/doc/tutorials/get-started/data-pipelines) and +[Access](/doc/tutorials/get-started/data-access) pages you'll learn more +advanced ways to track data. Mainly, `dvc run` can track the intermediate and +final results of complex data processes, and `dvc import` which brings in an +artifact from an external DVC repository. diff --git a/content/docs/tutorials/get-started/experiments.md b/content/docs/tutorials/get-started/experiments.md index c99199c5e0..6035d19d2d 100644 --- a/content/docs/tutorials/get-started/experiments.md +++ b/content/docs/tutorials/get-started/experiments.md @@ -1,31 +1,71 @@ # Experiments -Data science process is inherently iterative and R&D like. Data scientist may -try many different approaches, different hyperparameter values, and "fail" many -times before the required level of a metric is achieved. - -DVC is built to provide a way to capture different experiments and navigate -easily between them. Let's say we want to try a modified feature extraction: +Each stage in a pipeline is like a specialized machine in a production line. +Data scientists tend to tweak and configure them along the way, to improve the +final results. DVC provide ways to control these experiments with +[parameters](/doc/command-reference/params), compare their performance with +[metrics](#project-metrics), and switch between them easily with Git.
-### Expand to see code modifications +### ๐Ÿ‘‰ Expand to prepare the project -Edit `src/featurization.py` to enable bigrams and increase the number of -features. Find and change the `CountVectorizer` arguments, specify `ngram_range` -and increase number of features: +Get the sample project from Github with: -```python -bag_of_words = CountVectorizer(stop_words='english', - max_features=6000, - ngram_range=(1, 2)) +```dvc +$ git clone https://github.com/iterative/example-get-started +$ cd example-get-started +$ git checkout '7-ml-pipeline' +$ dvc pull ```
+## Tuning parameters + +Let's say we want to try a modified feature extraction. The +`src/featurization.py` script used to +[create the pipeline](/doc/tutorials/get-started/data-pipelines#dependency-graphs-dags) +actually accepts an optional third argument with the path to a YAML _parameters +file_ to load values to tune its vectorization. Let's generate it: + +```dvc +$ echo "max_features: 6000" > params.yaml +$ echo "ngram_range:" >> params.yaml +$ echo " lo: 1" >> params.yaml +$ echo " hi: 2" >> params.yaml +$ git add params.yaml +``` + +> Notice that we're versioning our parameters file with Git, in case we want to +> change its contents for further experiments. + +Let's now redefine the featurization stage so that DVC knows that it depends on +the specific values of `max_features` and `ngram_range`. For this we use the +`-p` (`--params`) option of `dvc run`. `params.yaml` is the default parameters +file name in DVC, so there's no need to specify this: + +```dvc +$ dvc run -y -f featurize.dvc \ + -d src/featurization.py -d data/prepared \ + -p max_features,ngram_range.lo,ngram_range.hi \ + -o data/features \ + python src/featurization.py \ + data/prepared data/features params.yaml + +$ git add featurize.dvc +$ git commit -m "Update featurization stage" +``` + +> Please refer to `dvc params` for more information. + +### Run the experiment + +Let's [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) our +pipeline up to the model training now: + ```dvc -$ vi src/featurization.py # edit to use bigrams (see above) -$ dvc repro train.dvc # regenerate the new model.pkl +$ dvc repro train.dvc $ git commit -am "Reproduce model using bigrams" ``` @@ -34,11 +74,13 @@ $ git commit -am "Reproduce model using bigrams" > [command reference](https://git-scm.com/docs/git-commit#Documentation/git-commit.txt--a) > for more details. +--- + Now, we have a new `model.pkl` captured and saved. To get back to the initial version, we run `git checkout` along with `dvc checkout` command: ```dvc -$ git checkout baseline-experiment +$ git checkout 'baseline-experiment' $ dvc checkout ``` @@ -47,3 +89,85 @@ your workspace almost instantly on almost all modern operating systems with file links. See [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) for more information. + +## Project metrics + +DVC metrics allow us to mark process outputs as files containing metrics to +track. They are defined using the `-m` (`--metrics`) option of `dvc run`. + +Let's add a final evaluation stage to our +[pipeline](/doc/tutorials/get-started/data-pipelines#dependency-graphs-dags): + +```dvc +$ dvc run -f evaluate.dvc \ + -d src/evaluate.py -d model.pkl -d data/features \ + -M auc.json \ + python src/evaluate.py model.pkl \ + data/features auc.json +``` + +`evaluate.py` reads features from the `features/test.pkl` file and calculates +the model's +[AUC](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5) +value. This metric is written to the `auc.json` file. We use the `-M` option in +the command above to mark the file as a metric in the stage file. + +> Please, refer to `dvc run` and `dvc metrics` documentation for more details. + +Let's save the updates: + +```dvc +$ git add evaluate.dvc auc.json +$ git commit -m "Model evaluation stage" +``` + +> Notice that we are versioning `auc.json` with Git directly. + +Let's also assign a Git tag. It will serve as a checkpoint for us to compare +experiments later: + +```dvc +$ git tag -a "baseline-experiment" -m "Baseline experiment evaluation" +``` + +## Compare experiments + +DVC makes it easy to iterate on your project using Git commits with tags or Git +branches. It provides a way to try different ideas, keep track of them, switch +back and forth. To find the best-performing experiment or track the progress, +[project metrics](/doc/command-reference/metrics) are supported in DVC (as +described in one of the previous sections). + +Let's run evaluate for the latest `bigrams` experiment we created earlier. It +mostly takes just running the `dvc repro`: + +```dvc +$ git checkout master +$ dvc checkout +$ dvc repro evaluate.dvc +``` + +`git checkout master` and `dvc checkout` commands ensure that we have the latest +experiment code and data respectively. And `dvc repro` is a way to run all the +necessary commands to build the model and measure its performance. + +```dvc +$ git commit -am "Evaluate bigrams model" +$ git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation" +``` + +Now, we can use `-T` option of the `dvc metrics show` command to see the +difference between the `baseline` and `bigrams` experiments: + +```dvc +$ dvc metrics show -T + +baseline-experiment: + auc.json: {"AUC": 0.588426} +bigrams-experiment: + auc.json: {"AUC": 0.602818} +``` + +DVC provides built-in support to track and navigate `JSON` or `YAML` metric +files if you want to track additional information. See `dvc metrics` to learn +more. diff --git a/content/docs/tutorials/get-started/import-data.md b/content/docs/tutorials/get-started/import-data.md deleted file mode 100644 index 6900533d5c..0000000000 --- a/content/docs/tutorials/get-started/import-data.md +++ /dev/null @@ -1,87 +0,0 @@ -# Import Data - -We've seen how to [push](/doc/tutorials/get-started/store-data) and -[pull](/doc/tutorials/get-started/retrieve-data) data from/to a DVC -project's [remote](/doc/command-reference/remote). But what if we wanted -to integrate a dataset or ML model produced in one project into another one? - -One way is to manually download the data (with `wget` or `dvc get`, for example) -and use `dvc add` to track it, but the connection between the projects would be -lost. We wouldn't be able to tell where the data came from or whether there are -new versions available. A better alternative is the `dvc import` command: - - - -```dvc -$ dvc import https://github.com/iterative/dataset-registry \ - get-started/data.xml -``` - -This downloads `data.xml` from our -[dataset-registry](https://github.com/iterative/dataset-registry) project into -the current working directory, adds it to `.gitignore`, and creates the -`data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to track changes in -the source data. With _imports_, we can use `dvc update` to bring in changes in -the external data source before -[reproducing](/doc/tutorials/get-started/reproduce) any pipeline -that depends on this data. - -
- -### Expand to learn more about imports - -Note that the [dataset-registry](https://github.com/iterative/dataset-registry) -repository doesn't actually contain a `get-started/data.xml` file. Instead, DVC -inspects -[get-started/data.xml.dvc](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc) -and tries to retrieve the file using the project's default remote (configured -[here](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)). - -DVC-files created by `dvc import` are called _import stages_. They use the -`repo` field in the dependencies section (`deps`) in order to track source data -changes (as an [external dependency](/doc/user-guide/external-dependencies)), -enabling the reusability of data artifacts. For example: - -```yaml -md5: fd56a1794c147fea48d408f2bc95a33a -locked: true -deps: - - path: get-started/data.xml - repo: - url: https://github.com/iterative/dataset-registry - rev_lock: 7476a858f6200864b5755863c729bff41d0fb045 -outs: - - md5: a304afb96060aad90176268345e10355 - path: data.xml - cache: true - metric: false - persist: false -``` - -The `url` and `rev_lock` subfields under `repo` are used to save the origin and -[version](https://git-scm.com/docs/revisions) of the dependency, respectively. - -> Note that `dvc update` updates the `rev_lock` field of the corresponding -> DVC-file (when there are changes to bring in). - -
- -Since this is not an official part of this _Get Started_, bring everything back -to normal with: - -```dvc -$ git reset --hard -$ rm -f data.* -``` - -> See also `dvc import-url`. diff --git a/content/docs/tutorials/get-started/index.md b/content/docs/tutorials/get-started/index.md new file mode 100644 index 0000000000..2cbede0efb --- /dev/null +++ b/content/docs/tutorials/get-started/index.md @@ -0,0 +1,51 @@ +# Get Started with DVC! + +Data Version Control is a data version control, data pipelining, and experiment +management command-line tool built on top of existing engineering toolset ans +practices, particularly Git. In this guide we will show the basic features of +DVC step by step. + +## Initialize + +Move into the directory you want to use as workspace, and use +`dvc init` inside to create a DVC project. It can contain existing +project files. At initialization, a new `.dvc/` directory is created for the +internal [files and directories](/doc/user-guide/dvc-files-and-directories): + +```dvc +$ dvc init +$ ls .dvc/ +config plots/ tmp/ +``` + +DVC is typically initialized on top of Git, which is needed for the +[versioning](/doc/tutorials/get-started/data-versioning) features. The `.dvc/` +directory is automatically staged with Git by `dvc init`, so it can be committed +right away: + +```dvc +$ git status +Changes to be committed: + new file: .dvc/.gitignore + new file: .dvc/config + ... +$ git commit -m "Initialize DVC repo" +``` + +## What's ahead? + +DVC functionality can be split into layers. Each one can be used independently, +but together they form a robust framework to capture and navigate machine +learning development. + +- [Data **versioning**](/doc/tutorials/get-started/data-versioning) is the basic + foundation for storing and sharing **evolving datasets** and ML models. We + work on a regular Git workflow, without storing **large files** with Git. + +- [Data **pipelines**](/doc/tutorials/get-started/data-pipelines) let you + register data modeling **workflows** that can be managed and **reproduced** + easily by you or others. + +- [**Experiments**](/doc/tutorials/get-started/experiments) are a natural part + of data science, or any R&D process. DVC provides special tools to define, + manage, tune, and **compare them** through _parameters_ and _metrics_. diff --git a/content/docs/tutorials/get-started/initialize.md b/content/docs/tutorials/get-started/initialize.md deleted file mode 100644 index 1e227d96c9..0000000000 --- a/content/docs/tutorials/get-started/initialize.md +++ /dev/null @@ -1,29 +0,0 @@ -# Initialize - -There are a few recommended ways to install DVC: OS-specific package/installer, -`pip`, `conda`, and Homebrew. See [Installation](/doc/install) for all the -alternatives and details. - -Let's start by creating a workspace we can version with Git. Then -run `dvc init` inside to create the DVC project: - -```dvc -$ mkdir example-get-started -$ cd example-get-started -$ git init -$ dvc init -$ git commit -m "Initialize DVC project" -``` - -At DVC initialization, a new `.dvc/` directory will be created for internal -configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories) that are -hidden from the user. - -> See `dvc init` if you want to get more details about the initialization -> process, and -> [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to -> learn about the DVC internal file and directory structure. - -The last command, `git commit`, versions the `.dvc/config` and `.dvc/.gitignore` -files (DVC internals) with Git. diff --git a/content/docs/tutorials/get-started/metrics.md b/content/docs/tutorials/get-started/metrics.md deleted file mode 100644 index e91ba6371f..0000000000 --- a/content/docs/tutorials/get-started/metrics.md +++ /dev/null @@ -1,45 +0,0 @@ -# Experiment Metrics - -Finally, we'd like to add an evaluation stage to our -[pipeline](/doc/command-reference/pipeline). Data science is a metric-driven -R&D-like process and `dvc metrics` commands along with DVC metric files provide -a framework to capture and compare experiments performance. It doesn't require -installing any databases or instrumenting your code to use some API, all is -tracked by Git and is stored in Git or DVC remote storage: - -```dvc -$ dvc run -f evaluate.dvc \ - -d src/evaluate.py -d model.pkl -d data/features \ - -M auc.metric \ - python src/evaluate.py model.pkl \ - data/features auc.metric -``` - -`evaluate.py` calculates AUC value using the test dataset. It reads features -from the `features/test.pkl` file and produces a -[metric](/doc/command-reference/metrics) file (`auc.metric`). Any -output (in this case just a plain text file containing a single -numeric value) can be marked as a metric, for example by using the `-M` option -of `dvc run`. - -> Please, refer to the `dvc metrics` command documentation to see more details. - -Let's save the updated results: - -```dvc -$ git add evaluate.dvc auc.metric -$ git commit -m "Create evaluation stage" -$ dvc push -``` - -Let's also assign a Git tag, it will serve as a checkpoint for us to compare -experiments in the future, or if we need to go back and checkout it and the -corresponding data: - -```dvc -$ git tag -a "baseline-experiment" -m "Baseline experiment evaluation" -``` - -The `dvc metrics show` command provides a way to compare different experiments, -by analyzing metric files across different branches, tags, etc. But first we -need to create a new experiment to compare the baseline with. diff --git a/content/docs/tutorials/get-started/older-versions.md b/content/docs/tutorials/get-started/older-versions.md deleted file mode 100644 index bde6bce562..0000000000 --- a/content/docs/tutorials/get-started/older-versions.md +++ /dev/null @@ -1,53 +0,0 @@ -# Get Older Data Version - -Now that we have multiple experiments, models, processed datasets, the question -is how do we revert back to an older version of a model file? Or how can we get -the previous version of the dataset if it was changed at some point? - -The answer is the `dvc checkout` command, and we already touched briefly the -process of switching between different data versions in the -[Experiments](/doc/tutorials/get-started/experiments) chapter of this _Get -Started_ section. - -Let's say we want to get the previous `model.pkl` file. The short answer is: - -```dvc -$ git checkout baseline-experiment train.dvc -$ dvc checkout train.dvc -``` - -These two commands will bring the previous model file to its place in the -workspace. - -
- -### Expand to learn about DVC internals - -DVC uses special [DVC-files](/doc/user-guide/dvc-file-format) to track data -files, directories, end results. In this case, `train.dvc` among other things -describes the `model.pkl` file this way: - -```yaml -outs: -md5: a66489653d1b6a8ba989799367b32c43 -path: model.pkl -``` - -`a664...2c43` is the "address" of the file in the local or remote DVC storage. - -It means that if we want to get to the previous version, we need to restore the -DVC-file first with the `git checkout` command. Only after that can DVC restore -the model file using the new "address" from the DVC-file. - -
- -To fully restore the previous experiment we just run `git checkout` and -`dvc checkout` without specifying a target: - -```dvc -$ git checkout baseline-experiment -$ dvc checkout -``` - -Read the `dvc checkout` command reference and a dedicated data versioning -[example](/doc/tutorials/versioning) for more information. diff --git a/content/docs/tutorials/get-started/pipeline.md b/content/docs/tutorials/get-started/pipeline.md deleted file mode 100644 index d9f0f19390..0000000000 --- a/content/docs/tutorials/get-started/pipeline.md +++ /dev/null @@ -1,44 +0,0 @@ -# Pipeline - -Support for [pipelines](/doc/command-reference/pipeline) is the biggest -difference between DVC and other version control tools that can handle large -data files (e.g. `git lfs`). By using `dvc run` multiple times, and specifying -outputs of a command (stage) as dependencies in another one, we can describe a -sequence of commands that gets to a desired result. This is what we call a -**data pipeline** or dependency graph. - -Let's create a second stage (after `prepare.dvc`, created in the previous -chapter) to perform feature extraction: - -```dvc -$ dvc run -f featurize.dvc \ - -d src/featurization.py -d data/prepared \ - -o data/features \ - python src/featurization.py \ - data/prepared data/features -``` - -And a third stage for training: - -```dvc -$ dvc run -f train.dvc \ - -d src/train.py -d data/features \ - -o model.pkl \ - python src/train.py data/features model.pkl -``` - -Let's commit DVC-files that describe our pipeline so far: - -```dvc -$ git add data/.gitignore .gitignore featurize.dvc train.dvc -$ git commit -m "Create featurization and training stages" -$ dvc push -``` - -This example is simplified just to show you a basic pipeline, see a more -advanced [example](/doc/tutorials/pipelines) or -[complete tutorial](/doc/tutorials/pipelines) to create an -[NLP](https://en.wikipedia.org/wiki/Natural_language_processing) pipeline -end-to-end. - -> See also the `dvc pipeline` command. diff --git a/content/docs/tutorials/get-started/reproduce.md b/content/docs/tutorials/get-started/reproduce.md deleted file mode 100644 index d6e6375878..0000000000 --- a/content/docs/tutorials/get-started/reproduce.md +++ /dev/null @@ -1,39 +0,0 @@ -# Reproduce - -In the previous chapters, we described our first -[pipeline](/doc/command-reference/pipeline). Basically, we generated a number of -[stage files](/doc/command-reference/run) -([DVC-files](/doc/user-guide/dvc-file-format)). These stages define individual -commands to execute towards a final result. Each depends on some data (either -raw data files or intermediate results from previous stages) and code files. - -If you just cloned the -[project](https://github.com/iterative/example-get-started), make sure you first -fetch the input data from DVC by calling `dvc pull`. - -It's now extremely easy for you or your colleagues to reproduce the result -end-to-end: - -```dvc -$ dvc repro train.dvc -``` - -> If you've just followed the previous chapters, the command above will have -> nothing to reproduce since you've recently executed all the pipeline stages. -> To easily try this command, clone this example -> [GitHub project](https://github.com/iterative/example-get-started) and run it -> from there. - -`train.dvc` describes which source code and data files to use, and how to run -the command in order to get the resulting model file. For each data file it -depends on, we can in turn do the same analysis: find a corresponding DVC-file -that includes the data file in its outputs, get dependencies and commands, and -so on. It means that DVC can recursively build a complete sequence of commands -it needs to execute to get the model file. - -`dvc repro` essentially builds a dependency graph, detects stages with modified -dependencies or missing outputs and recursively executes commands (nodes in this -graph or pipeline) starting from the first stage with changes. - -Thus, `dvc run` and `dvc repro` provide a powerful framework for _reproducible -experiments_ and _reproducible projects_. diff --git a/content/docs/tutorials/get-started/retrieve-data.md b/content/docs/tutorials/get-started/retrieve-data.md deleted file mode 100644 index 2a11926903..0000000000 --- a/content/docs/tutorials/get-started/retrieve-data.md +++ /dev/null @@ -1,29 +0,0 @@ -# Retrieve Data - -> You'll need to complete the -> [initialization](/doc/tutorials/get-started/initialize) and -> [configuration](/doc/tutorials/get-started/configure) chapters before being -> able to run the commands explained here. - -To retrieve data files into the workspace in your local machine, -run: - -```dvc -$ rm -f data/data.xml -$ dvc pull -``` - -This command downloads data files that are referenced in all -[DVC-files](/doc/user-guide/dvc-file-format) in the project. So, -you usually run it after `git clone`, `git pull`, or `git checkout`. - -Alternatively, if you want to retrieve a single dataset or a file you can use: - -```dvc -$ dvc pull data/data.xml.dvc -``` - -DVC remotes, `dvc push`, and `dvc pull` provide a basic collaboration workflow, -the same way as Git remotes, `git push` and `git pull`. See -[Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) for -more information. diff --git a/content/docs/tutorials/get-started/store-data.md b/content/docs/tutorials/get-started/store-data.md deleted file mode 100644 index 1306681e27..0000000000 --- a/content/docs/tutorials/get-started/store-data.md +++ /dev/null @@ -1,44 +0,0 @@ -# Store and Share Data - -Now, that your data files are managed by DVC (see -[Add Files](/doc/tutorials/get-started/add-files)), you can push them from your -repository to the default [remote](/doc/command-reference/remote) storage\*: - -```dvc -$ dvc push -``` - -The same way as with Git remote, it ensures that your data files and your models -are safely stored remotely and are shareable. This means that the data can be -pulled by yourself or your colleagues whenever you need it. - -Usually, you run it along with `git commit` and `git push` to save the changed -[DVC-files](/doc/user-guide/dvc-file-format). - -The `dvc push` command allows one to upload data to remote storage. It doesn't -save any changes in the code or DVC-files. Those should be saved by using -`git commit` and `git push`. - -> \*As noted in the DVC [configuration](/doc/tutorials/get-started/configure) -> chapter, we are using a **local remote** in this section for illustrative -> purposes. - -
- -### Expand to learn more about DVC internals - -You can check now that actual data file has been copied to the remote we created -in the [configuration](/doc/tutorials/get-started/configure) chapter: - -```dvc -$ ls -R /tmp/dvc-storage -/tmp/dvc-storage/a3: -04afb96060aad90176268345e10355 -``` - -`a304afb96060aad90176268345e10355` above is the hash value of the `data.xml` -file. If you check the `data.xml.dvc` -[DVC-file](/doc/user-guide/dvc-file-format), you will see that it has this -string inside. - -
diff --git a/content/docs/tutorials/get-started/visualize.md b/content/docs/tutorials/get-started/visualize.md deleted file mode 100644 index 5b7e5c293f..0000000000 --- a/content/docs/tutorials/get-started/visualize.md +++ /dev/null @@ -1,84 +0,0 @@ -# Visualize - -Now that we have built our pipeline, we need a good way to visualize it to be -able to wrap our heads around it. Luckily, DVC allows us to do that without -leaving the terminal, making the experience distraction-less. - -We are using the `--ascii` option below to better illustrate this pipeline. -Please, refer to `dvc pipeline show` to explore other options this command -supports (e.g. `.dot` files that can be used then in other tools). - -## Stages - -```dvc -$ dvc pipeline show --ascii train.dvc - +-------------------+ - | data/data.xml.dvc | - +-------------------+ - * - * - * - +-------------+ - | prepare.dvc | - +-------------+ - * - * - * - +---------------+ - | featurize.dvc | - +---------------+ - * - * - * - +-----------+ - | train.dvc | - +-----------+ -``` - -## Commands - -```dvc -$ dvc pipeline show --ascii train.dvc --commands - +-------------------------------------+ - | python src/prepare.py data/data.xml | - +-------------------------------------+ - * - * - * - +---------------------------------------------------------+ - | python src/featurization.py data/prepared data/features | - +---------------------------------------------------------+ - * - * - * - +---------------------------------------------+ - | python src/train.py data/features model.pkl | - +---------------------------------------------+ -``` - -## Outputs - -```dvc -$ dvc pipeline show --ascii train.dvc --outs - +---------------+ - | data/data.xml | - +---------------+ - * - * - * - +---------------+ - | data/prepared | - +---------------+ - * - * - * - +---------------+ - | data/features | - +---------------+ - * - * - * - +-----------+ - | model.pkl | - +-----------+ -``` diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index 537351863e..5283c3ea2c 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -29,7 +29,7 @@ and reproducible way. > We have tested our tutorials and examples with Python 3. We don't recommend > using earlier versions. -You'll need [Git](https://git-scm.com) to run the commands in this tutorial. +You'll need [Git](https://git-scm.com/) to run the commands in this tutorial. Also, if DVC is not installed, please follow these [instructions](/doc/install) to do so. @@ -50,13 +50,13 @@ $ git add code/ $ git commit -m "Download and add code to new Git repo" ``` -> `dvc get` can use any DVC repository to find the appropriate -> [remote storage](/doc/command-reference/remote) and download data -> artifacts from it (analogous to `wget`, but for repositories). In this -> case we use [dataset-registry](https://github.com/iterative/dataset-registry)) -> as the source repo. (Refer to -> [Data Registries](/doc/use-cases/data-registries) for more info about this -> setup.) +> `dvc get` can download any data artifact tracked in a DVC +> repository, using the appropriate +> [remote storage](/doc/command-reference/remote). It's like `wget`, but for DVC +> or Git repos. In this case we use our +> [dataset registry](https://github.com/iterative/dataset-registry) repo as the +> data source (refer to [Data Registries](/doc/use-cases/data-registries) for +> more info.) Now let's install the requirements. But before we do that, we **strongly** recommend creating a @@ -70,8 +70,9 @@ $ pip install -r code/requirements.txt ``` Next, we will create a [pipeline](/doc/command-reference/pipeline) step-by-step, -utilizing the same set of commands that are described in earlier -[Get Started](/doc/tutorials/get-started) chapters. +utilizing the same set of commands that are described in the +[Data Pipelines](/doc/tutorials/get-started/data-pipelines) page of the _Get +Started_. > Note that its possible to define more than one pipeline in each DVC project. > This will be determined by the interdependencies between DVC-files, mentioned diff --git a/content/docs/tutorials/versioning.md b/content/docs/tutorials/versioning.md index 06eb3a8624..ede422613b 100644 --- a/content/docs/tutorials/versioning.md +++ b/content/docs/tutorials/versioning.md @@ -29,7 +29,7 @@ model file. > We have tested our tutorials and examples with Python 3. We don't recommend > using earlier versions. -You'll need [Git](https://git-scm.com) to run the commands in this tutorial. +You'll need [Git](https://git-scm.com/) to run the commands in this tutorial. Also, if DVC is not installed, please follow these [instructions](/doc/install) to do so. @@ -83,13 +83,13 @@ $ unzip -q data.zip $ rm -f data.zip ``` -> `dvc get` can use any DVC repository to find the appropriate -> [remote storage](/doc/command-reference/remote) and download data -> artifacts from it (analogous to `wget`, but for repositories). In this -> case we use [dataset-registry](https://github.com/iterative/dataset-registry)) -> as the source repo. (Refer to -> [Data Registries](/doc/use-cases/data-registries) for more info about this -> setup.) +> `dvc get` can download any data artifact tracked in a DVC +> repository, using the appropriate +> [remote storage](/doc/command-reference/remote). It's like `wget`, but for DVC +> or Git repos. In this case we use our +> [dataset registry](https://github.com/iterative/dataset-registry) repo as the +> data source (refer to [Data Registries](/doc/use-cases/data-registries) for +> more info.) This command downloads and extracts our raw dataset, consisting of 1000 labeled images for training and 800 labeled images for validation. In total, it's a 43 @@ -370,5 +370,6 @@ Another detail we only brushed upon here is the way we captured the `metrics.csv` metric file with the `-M` option of `dvc run`. Marking this output as a metric enables us to compare its values across Git tags or branches (for example, representing different experiments). See `dvc metrics` -and [Compare Experiments](/doc/tutorials/get-started/compare-experiments) to -learn more about managing metrics with DVC. +and +[Compare Experiments](/doc/tutorials/get-started/experiment#compare-experiments) +to learn more about managing metrics with DVC. diff --git a/content/docs/understanding-dvc/related-technologies.md b/content/docs/understanding-dvc/related-technologies.md index d2b3729e8b..34afd0e9b6 100644 --- a/content/docs/understanding-dvc/related-technologies.md +++ b/content/docs/understanding-dvc/related-technologies.md @@ -129,13 +129,13 @@ Luigi, etc. - `git-lfs` was not made with data science scenarios in mind, so it does not provide related features (e.g. pipelines, - [metrics](/doc/command-reference/metrics)), and thus GitHub has a limit of 2 + [metrics](/doc/command-reference/metrics)), and thus Github has a limit of 2 GB per repository. --- -> \***copy-on-write links or "reflinks"** are a relatively new way to link files -> in UNIX-style file systems. Unlike hardlinks or symlinks, they support +> \* **copy-on-write links or "reflinks"** are a relatively new way to link +> files in UNIX-style file systems. Unlike hardlinks or symlinks, they support > transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This > means that editing a reflinked file is always safe as all the other links to > the file will reflect the changes. diff --git a/content/docs/understanding-dvc/what-is-dvc.md b/content/docs/understanding-dvc/what-is-dvc.md index 39aab4b8e9..56865ef8c8 100644 --- a/content/docs/understanding-dvc/what-is-dvc.md +++ b/content/docs/understanding-dvc/what-is-dvc.md @@ -1,16 +1,15 @@ # What Is DVC? Data Version Control, or DVC, is **a new type of experiment management -software** that has been built **on top of the existing engineering toolset that -you're already used to**, and particularly on a source code version control -system (currently Git). DVC reduces the gap between existing tools and data -science needs, allowing users to take advantage of experiment management -software while reusing existing skills and intuition. - -The underlying source code control system eliminates the need to use external -services. Data science experiment sharing and collaboration can be done through -regular Git tools (commit messages, merges, pull requests, etc) the same way it -works for software engineers. +software** built on top of the existing engineering toolset that you're already +used to, and particularly on a source code management (Git). DVC reduces the gap +between existing tools and data science needs, allowing users to take advantage +of experiment management while reusing existing skills and intuition. + +Leveraging an underlying source code management system eliminates the need to +use external services. Data science experiment sharing and collaboration can be +done through regular Git features (commit messages, merges, pull requests, etc) +the same way it works for software engineers. DVC implements a **Git experimentation methodology** where each experiment exists with its code as well as data, and can be represented as a separate Git diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index e585a0095b..4b66691b5f 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -48,7 +48,7 @@ Data registries can be created like any other DVC repository with `git init` and `dvc init`. A good way to organize them is with different directories, to group the data into separate uses, such as `images/`, `natural-language/`, etc. For example, our -[dataset-registry](https://github.com/iterative/dataset-registry) uses a +[dataset registry](https://github.com/iterative/dataset-registry) uses a directory for each part in our docs, like `get-started/`, `use-cases/`, etc. Adding datasets to a registry can be as simple as placing the data file or diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 0f522a8326..7640d75cb4 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -12,5 +12,21 @@ first. > in the community. Please, [contact us](/support) if you need help or have > suggestions! -Our use cases range from basic to more advanced. Please choose from the +## Basic uses + +If you store and process data files or datasets to produce other data or machine +learning models, and you want to + +- capture and save data artifacts the same way you capture code; +- track and switch between different versions of data or models easily; +- understand how data or models were built in the first place; +- be able to compare models and metrics to each other; +- bring software engineering best practices to your data science team; +- among other [use cases](/doc/use-cases) + +DVC is for you! + +--- + +Our use case pages range from basic to more advanced. Please choose from the navigation sidebar to the left, or click the `Next` button below โ†˜ diff --git a/content/docs/user-guide/basic-concepts/data-artifact.md b/content/docs/user-guide/basic-concepts/data-artifact.md index 2226b3ef06..2926790f7a 100644 --- a/content/docs/user-guide/basic-concepts/data-artifact.md +++ b/content/docs/user-guide/basic-concepts/data-artifact.md @@ -1,8 +1,8 @@ --- name: 'Data Artifact' -match: ['data artifact', 'data artifacts'] +match: ['data artifact', 'data artifacts', 'artifact', 'artifacts'] --- Any data file or directory, as well as intermediate or final result that is -tracked by DVC, for example by using `dvc add`. See +tracked by DVC, for example by using `dvc add` or `dvc run`. See [Versioning Data and Models](/doc/use-cases/versioning-data-and-model-files). diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index 486e994379..ed977e8624 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -13,7 +13,7 @@ match: ] --- -Initialized by running `dvc init` in the **workspace** (typically in a Git +Initialized by running `dvc init` in the **workspace** (typically a Git repository). It will contain the [`.dvc/` directory](/doc/user-guide/dvc-files-and-directories) and [DVC-files](/doc/user-guide/dvc-file-format) created with commands such as diff --git a/content/docs/user-guide/basic-concepts/workspace.md b/content/docs/user-guide/basic-concepts/workspace.md index 99ad8d8193..76621c5e0c 100644 --- a/content/docs/user-guide/basic-concepts/workspace.md +++ b/content/docs/user-guide/basic-concepts/workspace.md @@ -3,5 +3,6 @@ name: Workspace match: [workspace] --- -Directory containing all your project files, for example raw datasets, source -code, ML models, etc. It will contain your DVC project. +Directory containing all your project files e.g. raw datasets, source code, ML +models, etc. Typically, it's also a Git repository. It will contain your DVC +project. diff --git a/content/docs/user-guide/contributing/core.md b/content/docs/user-guide/contributing/core.md index 899fcd06c1..4389353ba4 100644 --- a/content/docs/user-guide/contributing/core.md +++ b/content/docs/user-guide/contributing/core.md @@ -320,7 +320,7 @@ Format: (long description) -Fixes #(GitHub issue id). +Fixes #(Github issue id). ``` Message types: @@ -329,7 +329,7 @@ Message types: - _short description_: Short description of the patch - _long description_: If needed, longer message describing the patch in more details -- _github issue id_: ID of the GitHub issue that this patch is addressing +- _github issue id_: ID of the Github issue that this patch is addressing Example: diff --git a/content/docs/user-guide/contributing/docs.md b/content/docs/user-guide/contributing/docs.md index c7f6acb131..d15ef180ab 100644 --- a/content/docs/user-guide/contributing/docs.md +++ b/content/docs/user-guide/contributing/docs.md @@ -31,7 +31,7 @@ to update the docs and redeploy the website. In case of a minor change, you can use the **Edit on GitHub** button (found to the right of each page) to fork the repository, edit it in place (with the -source code file **Edit** button in GitHub), and create a pull request (PR). +source code file **Edit** button in Github), and create a pull request (PR). Otherwise, please refer to the following procedure: @@ -197,9 +197,10 @@ We use **bold** text for emphasis, and _italics_ for special terms. We also use "emoji" symbols sparingly for visibility on certain notes. Mainly: +- ๐Ÿ“– For notes that link to other related documentation - โš ๏ธ Warnings about possible problems related to DVC usage (similar to **Note!** and "Note that..." notes) - ๐Ÿ’ก Useful tips related to external tools/integrations -> Some other emojis currently in use here and there: โšก๐Ÿ™๐Ÿ›โญโ—โœ… (We're not -> limited to these.) +> Some other emojis currently in use here and there: โšกโœ…๐Ÿ™๐Ÿ›โญโ— (among +> others). diff --git a/content/docs/user-guide/large-dataset-optimization.md b/content/docs/user-guide/large-dataset-optimization.md index 2862d7b47c..4c64803ce6 100644 --- a/content/docs/user-guide/large-dataset-optimization.md +++ b/content/docs/user-guide/large-dataset-optimization.md @@ -9,11 +9,11 @@ In order to track the data files and directories added with `dvc add` or details.) However, the versions of the tracked files that -[match the current code](/doc/tutorials/get-started/connect-code-and-data) are -also needed in the workspace, so a subset of the cached files must -be kept in the working directory (using `dvc checkout`). Does this mean that -some files will be duplicated between the workspace and the cache? **That would -not be efficient!** Especially with large files (several Gigabytes or larger). +[match the current code](/doc/tutorials/get-started/data-pipelines) are also +needed in the workspace, so a subset of the cached files must be +kept in the working directory (using `dvc checkout`). Does this mean that some +files will be duplicated between the workspace and the cache? **That would not +be efficient!** Especially with large files (several Gigabytes or larger). In order to have the files present in both directories without duplication, DVC can automatically create **file links** to the cached data in the workspace. In @@ -127,8 +127,8 @@ To make sure that the data files in the workspace are consistent with the --- -> \***copy-on-write links or "reflinks"** are a relatively new way to link files -> in UNIX-style file systems. Unlike hardlinks or symlinks, they support +> \* **copy-on-write links or "reflinks"** are a relatively new way to link +> files in UNIX-style file systems. Unlike hardlinks or symlinks, they support > transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This > means that editing a reflinked file is always safe as all the other links to > the file will reflect the changes. diff --git a/content/docs/user-guide/running-dvc-on-windows.md b/content/docs/user-guide/running-dvc-on-windows.md index 6de6ce6d56..e658eee649 100644 --- a/content/docs/user-guide/running-dvc-on-windows.md +++ b/content/docs/user-guide/running-dvc-on-windows.md @@ -16,12 +16,12 @@ perfect solution, bu here are some ideas: [Git for Windows](https://gitforwindows.org/)\* (Git Bash) among other [shell options](https://github.com/cmderdev/cmder/blob/master/README.md#access-to-multiple-shells-in-one-window-using-tabs). - [Anaconda Prompt](https://docs.anaconda.com/anaconda/user-guide/getting-started/#open-prompt-win) - is another recommendation, but it may not support all the desired CLI features - (e.g. `\` line continuation). + is another recommendation. - Consider enabling and using [WSL](https://blogs.windows.com/windowsdeveloper/2016/03/30/run-bash-on-ubuntu-on-windows/) ([Windows Terminal](https://devblogs.microsoft.com/commandline/) also - recommended). But it has major + recommended) which supports the most CLI features (e.g. `\` line + continuation). But it has major [I/O performance issues](https://www.phoronix.com/scan.php?page=article&item=windows10-okt-wsl&num=2) and is [unable to access GPUs](https://github.com/Microsoft/WSL/issues/829), et al.\* @@ -70,9 +70,9 @@ directory, as explained in ## Enabling paging with `less` By default, DVC tries to use [Less]() -as pager for the output of `dvc pipeline show`. Windows doesn't have the less -command available however. Fortunately, there is a easy way of installing `less` -via [Chocolatey](https://chocolatey.org/) (please install the tool first): +as pager for the output of `dvc pipeline show`. Windows doesn't have the `less` +command available however. Fortunately, there is a easy way of installing it via +[Chocolatey](https://chocolatey.org/) (please install the tool first): ```dvc $ choco install less diff --git a/content/docs/user-guide/updating-tracked-files.md b/content/docs/user-guide/updating-tracked-files.md index 8202f7b6e9..eab994a62f 100644 --- a/content/docs/user-guide/updating-tracked-files.md +++ b/content/docs/user-guide/updating-tracked-files.md @@ -9,7 +9,7 @@ corruption when the DVC config option `cache.type` is set to `hardlink` or/and link types.) > For an example of the cache corruption problem see -> [issue #599](https://github.com/iterative/dvc/issues/599) in our GitHub +> [issue #599](https://github.com/iterative/dvc/issues/599) in our Github > repository. Assume `train.tsv` is tracked by dvc and you want to update it. Here updating