diff --git a/content/docs/start/data-management/data-pipelines.md b/content/docs/start/data-management/data-pipelines.md index dd271b1b85..4bcf2b518b 100644 --- a/content/docs/start/data-management/data-pipelines.md +++ b/content/docs/start/data-management/data-pipelines.md @@ -14,38 +14,40 @@ https://youtu.be/71IGzyH95UY -Versioning large data files and directories for data science is great, but not -enough. How is data filtered, transformed, or used to train ML models? DVC -introduces a mechanism to capture _data pipelines_ — series of data processes -that produce a final result. - -DVC pipelines and their data can also be easily versioned (using Git). This -allows you to better organize projects, and reproduce your workflow and results -later — exactly as they were built originally! For example, you could capture a -simple ETL workflow, organize a data science project, or build a detailed -machine learning pipeline. - -Later on, we will find DVC manages the execution of +Versioning large data files and directories for data science is powerful, but +often not enough. Data needs to be filtered, cleaned, and transformed before +training ML models - for that purpose DVC introduces a build system to define, +execute and track _data pipelines_ — a series of data processing stages, that +produce a final result. + +**💫 DVC is a "Makefile" system for machine learning projects!** + +DVC pipelines are versioned using Git, and allow you to better organize projects +and reproduce complete workflows and results at will. You could capture a simple +ETL workflow, organize your project, or build a complex DAG (Directed Acyclic +Graph) pipeline. + +Later, we will find DVC allows you to manage [machine learning experiments](/doc/start/experiments/experiment-pipelines) on top of these pipelines - controlling their execution, injecting parameters, etc. -## Pipeline stages +## Setup -Use `dvc stage add` to create _stages_. These represent processes (source code -tracked with Git) which form the steps of a _pipeline_. Stages also connect code -to its corresponding data _input_ and _output_. Let's transform a Python script -into a [stage](/doc/command-reference/stage): +Working inside an [initialized DVC project](/doc/start#initializing-a-project), +let's get some sample code for the next steps: + +```cli +$ wget https://code.dvc.org/get-started/code.zip +$ unzip code.zip && rm -f code.zip +```
-### ⚙️ Expand to download example code. +### 💡 Expand to inspect project structure Get the sample code like this: ```cli -$ wget https://code.dvc.org/get-started/code.zip -$ unzip code.zip -$ rm -f code.zip $ tree . ├── params.yaml @@ -57,28 +59,47 @@ $ tree └── train.py ``` -Now let's install the requirements: +
-> We **strongly** recommend creating a -> [virtual environment](https://python.readthedocs.io/en/stable/library/venv.html) -> first. +The DVC tracked data needed to run this example can be downloaded using +`dvc get`: ```cli -$ pip install -r src/requirements.txt +$ dvc get https://github.com/iterative/dataset-registry \ + get-started/data.xml -o data/data.xml ``` -Please also add or commit the source code directory with Git at this point. +Now, let's go through some usual project setup steps (virtualenv, requirements, +Git). - +First, create and use a +[virtual environment](https://python.readthedocs.io/en/stable/library/venv.html) +(it's not a must, but we **strongly** recommend it): -The data needed to run this example can be found [in a previous page]. +```cli +$ virtualenv venv && echo "venv" > .gitignore +$ source venv/bin/activate +``` - +Next, install the Python requirements: -[in a previous page]: - /doc/start/data-management/data-versioning#expand-to-get-an-example-dataset +```cli +$ pip install -r src/requirements.txt +``` - +Finally, this is a good time to commit our code to Git: + +```cli +$ git add .github/ data/ params.yaml src .gitignore +$ git commit -m "Initial commit" +``` + +## Pipeline stages + +Use `dvc stage add` to create _stages_. These represent processing steps +(usually scripts/code tracked with Git) and combine to form the _pipeline_. +Stages allow connecting code to its corresponding data _input_ and _output_. +Let's transform a Python script into a [stage](/doc/command-reference/stage): ```cli $ dvc stage add -n prepare \ @@ -92,24 +113,29 @@ A `dvc.yaml` file is generated. It includes information about the command we want to run (`python src/prepare.py data/data.xml`), its dependencies, and outputs. -DVC uses these metafiles to track the data used and produced by the stage, so -there's no need to use `dvc add` on `data/prepared` -[manually](/doc/start/data-management/data-versioning). + + +DVC uses the pipeline definition to **automatically track** the data used and +produced by any stage, so there's no need to manually run `dvc add` for +`data/prepared`! + +
-### 💡 Expand to see what happens under the hood. +### 💡 Expand to get a peek under the hood -The command options used above mean the following: +Details on the command options used above: - `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file you will see a section named `prepare`. - `-p prepare.seed,prepare.split` defines special types of dependencies — - [parameters](/doc/command-reference/params). We'll get to them later in the + [parameters](/doc/command-reference/params). Any stage can depend on parameter + values from a parameters file (`params.yaml` by default). We'll discuss those + more in the [Metrics, Parameters, and Plots](/doc/start/data-management/metrics-parameters-plots) - page, but the idea is that the stage can depend on field values from a - parameters file (`params.yaml` by default): + page. ```yaml prepare: @@ -118,13 +144,15 @@ prepare: ``` - `-d src/prepare.py` and `-d data/data.xml` mean that the stage depends on - these files to work. Notice that the source code itself is marked as a - dependency. If any of these files change later, DVC will know that this stage - needs to be [reproduced](#reproduce). + these files (dependencies) to work. Notice that the source code itself is + marked as a dependency as well. If any of these files change, DVC will know + that this stage needs to be [reproduced](#reproduce) when the pipeline is + executed. - `-o data/prepared` specifies an output directory for this script, which writes - two files in it. This is how the workspace should look like after - the run: + two files in it. + + This is how the workspace looks like after the run: ```git . @@ -162,22 +190,17 @@ stages:
-Once you added a stage, you can run the pipeline with `dvc repro`. Next, you can -use `dvc push` if you wish to save all the data [to remote storage] (usually -along with `git commit` to version DVC metafiles). - -[to remote storage]: - /doc/start/data-management/data-versioning#storing-and-sharing +Once you've added a stage, you can run the pipeline with `dvc repro`. ## Dependency graphs By using `dvc stage add` multiple times, defining outputs of a stage as dependencies of another, we can describe a sequence of -commands which gets to some desired result. This is what we call a [dependency -graph] and it's what forms a cohesive pipeline. +dependent commands which gets to some desired result. This is what we call a +[dependency graph] which forms a full cohesive pipeline. -Let's create a second stage chained to the outputs of `prepare`, to perform -feature extraction: +Let's create a 2nd stage chained to the outputs of `prepare`, to perform feature +extraction: ```cli $ dvc stage add -n featurize \ @@ -187,49 +210,9 @@ $ dvc stage add -n featurize \ python src/featurization.py data/prepared data/features ``` -The `dvc.yaml` file is updated automatically and should include two stages now. - -
- -### 💡 Expand to see what happens under the hood. - -The changes to the `dvc.yaml` should look like this: - -```git - stages: - prepare: - cmd: python src/prepare.py data/data.xml - deps: - - data/data.xml - - src/prepare.py - params: - - prepare.seed - - prepare.split - outs: - - data/prepared -+ featurize: -+ cmd: python src/featurization.py data/prepared data/features -+ deps: -+ - data/prepared -+ - src/featurization.py -+ params: -+ - featurize.max_features -+ - featurize.ngrams -+ outs: -+ - data/features -``` - -Note that you can create and edit `dvc.yaml` files manually instead of using -helper `dvc stage add`. +The `dvc.yaml` file will now be updated to include the two stages. -
- -
- -### ⚙️ Expand to add more stages. - -Let's add the training itself. Nothing new this time; just the same -`dvc stage add` command with the same set of options: +And finally, let's add a 3rd `train` stage: ```cli $ dvc stage add -n train \ @@ -239,61 +222,36 @@ $ dvc stage add -n train \ python src/train.py data/features model.pkl ``` -Please check the `dvc.yaml` again, it should have one more stage now. +Finally, our `dvc.yaml` should have all 3 stages. -
- -This should be a good time to commit the changes with Git. These include -`.gitignore`, `dvc.lock`, and `dvc.yaml` — which describe our pipeline. + -## Reproduce - -The whole point of creating this `dvc.yaml` file is the ability to easily -reproduce a pipeline: +This would be a good time to commit the changes with Git. These include +`.gitignore`(s) and `dvc.yaml` — which describes our pipeline. ```cli -$ dvc repro +$ git add .gitignore data/.gitignore dvc.yaml +$ git commit -m "pipeline defined" ``` -
- -### ⚙️ Expand to have some fun with it. - -Let's try to play a little bit with it. First, let's try to change one of the -parameters for the training stage: - -1. Open `params.yaml` and change `n_est` to `100`, and -2. (re)run `dvc repro`. - -You should see: + -```cli -$ dvc repro -Stage 'prepare' didn't change, skipping -Stage 'featurize' didn't change, skipping -Running stage 'train' with command: ... -``` +Great! Now we're ready to run the pipeline. -DVC detected that only `train` should be run, and skipped everything else! All -the intermediate results are being reused. +## Reproducing -Now, let's change it back to `50` and run `dvc repro` again: +The pipeline definition in `dvc.yaml` allow us to easily reproduce the pipeline: ```cli $ dvc repro -Stage 'prepare' didn't change, skipping -Stage 'featurize' didn't change, skipping ``` -As before, there was no need to rerun `prepare`, `featurize`, etc. But this time -it also doesn't rerun `train`! The previous run with the same set of inputs -(parameters & data) was saved in DVC's run cache, and reused here. - -
+You'll notice a `dvc.lock` (a "state file") was created to capture the +reproduction's results.
-### 💡 Expand to see what happens under the hood. +### 💡 Expand to get a peek under the hood `dvc repro` relies on the [dependency graph] of stages defined in `dvc.yaml`, and uses `dvc.lock` to determine what exactly needs to be run. @@ -336,26 +294,54 @@ state of the workspace.
-DVC pipelines (`dvc.yaml` file, `dvc stage add`, and `dvc repro` commands) solve -a few important problems: +It's good practice to immediately commit `dvc.lock` to Git after its creation or +modification, to record the current state & results: -- _Automation_: run a sequence of steps in a "smart" way which makes iterating - on your project faster. DVC automatically determines which parts of a project - need to be run, and it caches "runs" and their results to avoid unnecessary - reruns. -- _Reproducibility_: `dvc.yaml` and `dvc.lock` files describe what data to use - and which commands will generate the pipeline results (such as an ML model). - Storing these files in Git makes it easy to version and share. -- [_Continuous Delivery and Continuous Integration (CI/CD) for ML_](/doc/use-cases/ci-cd-for-machine-learning): - describing projects in way that can be reproduced (built) is the first - necessary step before introducing CI/CD systems. See our sister project - [CML](https://cml.dev) for some examples. +```cli +$ git add dvc.lock && git commit -m "first pipeline repro" +``` + +
+ +### ⚙️ Learn how to parametrize and use cached results + +Let's try to have a little bit of fun with it. First, change one of the +parameters for the training stage: + +1. Open `params.yaml` and change `n_est` to `100`, and +2. (re)run `dvc repro`. + +You will see: + +```cli +$ dvc repro +Stage 'prepare' didn't change, skipping +Stage 'featurize' didn't change, skipping +Running stage 'train' with command: ... +``` + +DVC detected that only `train` should be run, and skipped everything else! All +the intermediate results are being reused. + +Now, let's change it back to `50` and run `dvc repro` again: + +```cli +$ dvc repro +Stage 'prepare' didn't change, skipping +Stage 'featurize' didn't change, skipping +``` + +As before, there was no need to rerun `prepare`, `featurize`, etc. But this time +it also doesn't rerun `train`! The previous run with the same set of inputs +(parameters & data) was saved in DVC's run cache, and was reused. -## Visualize +
+ +## Visualizing Having built our pipeline, we need a good way to understand its structure. -Seeing a graph of connected stages would help. DVC lets you do so without -leaving the terminal! +Visualizing it as a graph of connected stages helps with that. DVC lets you do +so without leaving the terminal! ```cli $ dvc dag @@ -376,5 +362,25 @@ $ dvc dag +-------+ ``` -> Refer to `dvc dag` to explore other ways this command can visualize a -> pipeline. + + +Refer to `dvc dag` to explore other ways this command can visualize a pipeline. + + + +## Summary + +DVC pipelines (`dvc.yaml` file, `dvc stage add`, and `dvc repro` commands) solve +a few important problems: + +- _Automation_: run a sequence of steps in a "smart" way which makes iterating + on your project faster. DVC automatically determines which parts of a project + need to be run, and it caches "runs" and their results to avoid unnecessary + reruns. +- _Reproducibility_: `dvc.yaml` and `dvc.lock` files describe what data to use + and which commands will generate the pipeline results (such as an ML model). + Storing these files in Git makes it easy to version and share. +- [_Continuous Delivery and Continuous Integration (CI/CD) for ML_](/doc/use-cases/ci-cd-for-machine-learning): + describing projects in a way that can be built and reproduced is the first + necessary step before introducing CI/CD systems. See our sister project + [CML](https://cml.dev) for some examples. diff --git a/content/docs/start/data-management/data-versioning.md b/content/docs/start/data-management/data-versioning.md index a64c788cbf..062fb1be07 100644 --- a/content/docs/start/data-management/data-versioning.md +++ b/content/docs/start/data-management/data-versioning.md @@ -15,18 +15,20 @@ https://youtu.be/kLKBcPonMYw -How cool would it be to make Git handle arbitrarily large files and directories -with the same performance it has with small code files? Imagine cloning a -repository and seeing data files and machine learning models in the workspace. -Or switching to a different version of a 100Gb file in less than a second with a -`git checkout`. Think "Git for data". +How cool would it be to track large datasets and machine learning models +alongside your code, sidestepping all the limitations of storing it in Git? +Imagine cloning a repository and immediately seeing your datasets, checkpoints +and models staged in your workspace. Imagine switching to a different version of +a 100Gb file in less than a second with a `git checkout`. -
+**💫 DVC is your _"Git for data"_!** -### ⚙️ Expand to get an example dataset. +## Tracking data -Having initialized a project in the previous section, we can get the data file -(which we'll be using later) like this: +Working inside an [initialized project](/doc/start#initializing-a-project) +directory, let's pick a piece of data to work with. We'll use an example +`data.xml` file, though any text or binary file (or directory) will do. Start by +running: ```cli $ dvc get https://github.com/iterative/dataset-registry \ @@ -35,42 +37,41 @@ $ dvc get https://github.com/iterative/dataset-registry \ -We use the fancy `dvc get` command to jump ahead a bit and show how a Git repo -becomes a source for datasets or models — what we call a [data registry]. -`dvc get` can download any file or directory tracked in a DVC +We used `dvc get` above to show how DVC can turn any Git repo into a "[data +registry]". `dvc get` can download any file or directory tracked in a DVC repository. [data registry]: /doc/use-cases/data-registry -
- -To start tracking a file or directory, use `dvc add`: +Use `dvc add` to start tracking the dataset file: ```cli $ dvc add data/data.xml ``` DVC stores information about the added file in a special `.dvc` file named -`data/data.xml.dvc` -- a small text file with a human-readable [format]. This -metadata file is a placeholder for the original data, and can be easily -versioned like source code with Git: +`data/data.xml.dvc`. This small, human-readable metadata file acts as a +placeholder for the original data for the purpose of Git tracking. + +Next, run the following commands to track changes in Git: ```cli $ git add data/data.xml.dvc data/.gitignore $ git commit -m "Add raw data" ``` -The data, meanwhile, is listed in `.gitignore`. +Now the _metadata about your data_ is versioned alongside your source code, +while the original data file was added to `.gitignore`. -
+
-### 💡 Click to see what happens under the hood. +### 💡 Expand to get a peek under the hood `dvc add` moved the data to the project's cache, and -linked it back to the workspace. The `.dvc/cache` -should look like this: +linked it back to the workspace. The `.dvc/cache` will +look like this: ``` .dvc/cache @@ -90,28 +91,21 @@ outs:
-[format]: /doc/user-guide/project-structure/dvc-files - ## Storing and sharing -You can upload DVC-tracked data or models with `dvc push`. This requires setting -up [remote storage] first, for example on Amazon S3: - -[remote storage]: /doc/user-guide/data-management/remote-storage - -```cli -$ dvc remote add -d storage s3://mybucket/dvcstore -$ dvc push -``` - -
+You can upload DVC-tracked data to a variety of storage systems (remote or +local) referred to as +["remotes"](/doc/user-guide/data-management/remote-storage). For simplicity, we +will use a "local remote" for this guide, which is just a directory in the local +file system. -### ⚠️ That didn't work! +### Configuring a remote -Instead of the S3 remote in the next block, use this "local remote" (another -directory in the local file system) to try `dvc push`: +Before pushing data to a remote we need to set it up using the `dvc remote add` +command: + ```cli @@ -130,21 +124,42 @@ $ dvc remote add -d myremote %TEMP%\dvcstore - + -DVC supports many remote [storage types], including Amazon S3, SSH, Google +DVC supports many remote [storage types], including Amazon S3, NFS,SSH, Google Drive, Azure Blob Storage, and HDFS. +An example for a common use case is configuring an [Amazon S3] remote: + +```cli +$ dvc remote add -d storage s3://mybucket/dvcstore +``` + +For this to work, you'll need an AWS account and credentials set up to allow +access. + +To learn more about storage remotes, see the [Remote Storage Guide]. + +[Amazon S3]: /doc/user-guide/data-management/remote-storage/amazon-s3 [storage types]: /doc/user-guide/data-management/remote-storage#supported-storage-types +[Remote Storage Guide]: /doc/user-guide/data-management/remote-storage
-
+### Uploading data + +Now that a storage remote was configured, run `dvc push` to upload data: -### 💡 Click to see what happens under the hood. +```cli +$ dvc push +``` + +
+ +#### 💡 Expand to get a peek under the hood `dvc push` copied the data cached locally to the remote storage we set up earlier. The remote storage directory should look like this: @@ -161,21 +176,30 @@ If you prefer to keep human-readable filenames, you can use [cloud versioning].
-Usually, we also want to `git commit` (and `git push`) the project config -changes. +Usually, we would also want to Git track any code changes that led to the data +change ( `git add`, `git commit` and `git push` ). + +### Retrieving data + +Once DVC-tracked data and models are stored remotely, they can be downloaded +with `dvc pull` when needed (e.g. in other copies of this project). +Usually, we run it after `git pull` or `git clone`. -## Retrieving +Let's try this now: -Having DVC-tracked data and models stored remotely, it can be downloaded with -`dvc pull` when needed (e.g. in other copies of this project). -Usually, we run it after `git clone` and `git pull`. +```cli +$ dvc pull +```
-### ⚙️ Expand to delete locally cached data. +#### Expand to simulate a "fresh pull" -If you've run `dvc push` successfully, empty the cache and delete -`data/data.xml` for `dvc pull` to have an effect: +After running `dvc push` above, the `dvc pull` command afterwards was +short-circuited by DVC for efficiency. The project's `data/data.xml` file, our +cache and the remote storage were all already in sync. We need to +empty the cache and delete `data/data.xml` from our project if we +want to have DVC actually moving data around. Let's do that now: @@ -196,29 +220,18 @@ $ del data\data.xml -
+Now we can run `dvc pull` to retrieve the data from the remote: ```cli $ dvc pull ``` - - -See [Remote Storage] for more information on remote storage. - - - -## Making changes - -When you make a change to a file or directory, run `dvc add` again to track the -latest version: - -
+
-### ⚙️ Expand to make some changes. +## Making local changes -Let's say we obtained more data from some external source. We can pretend this -is the case by doubling the dataset: +Next, let's say we obtained more data from some external source. We will +simulate this by doubling the dataset contents: @@ -239,13 +252,14 @@ $ type %TEMP%\data.xml >> data\data.xml -
+After modifying the data, run `dvc add` again to track the latest version: ```cli $ dvc add data/data.xml ``` -Usually you would also run `dvc push` and `git commit` to save the changes: +Now we can run `dvc push` to upload the changes to the remote storage, followed +by a `git commit` to track them: ```cli $ dvc push @@ -254,17 +268,16 @@ $ git commit data/data.xml.dvc -m "Dataset updates" ## Switching between versions -The regular workflow is to use `git checkout` first (to switch a branch or -checkout a `.dvc` file version) and then run `dvc checkout` to sync data: +A commonly used workflow is to use `git checkout` to switch to a branch or +checkout a specific `.dvc` file revision, followed by a `dvc checkout` to sync +data into your workspace: ```cli $ git checkout <...> $ dvc checkout ``` -
- -### ⚙️ Expand to get the previous version of the dataset. +## Return to a previous version of the dataset Let's go back to the original version of the data: @@ -280,33 +293,20 @@ of the dataset was already saved): $ git commit data/data.xml.dvc -m "Revert dataset updates" ``` -
- -Yes, DVC is technically not a version control system! Git itself provides that -layer. DVC in turn manipulates `.dvc` files, whose contents define the data file -versions. DVC also synchronizes DVC-tracked data in the workspace -efficiently to match them. - -## Discovering and accessing data - -DVC helps you with accessing and using your data artifacts from outside of the -project where they are versioned, and your tracked data can be imported and -fetched from anywhere. For example, you may want to download a specific version -of an ML model to a deployment server or import a dataset into another project. -To learn about how DVC allows you to do this, see the -[discovering and accessing data guide](/doc/user-guide/data-management/discovering-and-accessing-data). + -## Large datasets versioning +As you can see, DVC is technically not a version control system by itself! It +manipulates `.dvc` files, whose contents define the data file versions. Git is +already used to version your code, and now it can also version your data +alongside it. -In cases where you process very large datasets, you need an efficient mechanism -(in terms of space and performance) to share a lot of data, including different -versions. Do you use network attached storage (NAS)? Or a large external volume? -You can learn more about advanced workflows using these links: + -- A [shared cache](/doc/user-guide/how-to/share-a-dvc-cache) can be set up to - store, version and access a lot of data on a large shared volume efficiently. -- An advanced scenario is to track and version data directly on the remote - storage (e.g. S3, SSH). See [Managing External Data] to learn more. +### Discovering and accessing data -[managing external data]: - https://dvc.org/doc/user-guide/data-management/managing-external-data +Your tracked data can be imported and fetched from anywhere using DVC. For +example, you may want to download a specific version of an ML model to a +deployment server or import a dataset into another project like we did at the +[top of this chapter](/doc/start/data-management/data-versioning?tab=Mac-Linux#tracking-data). +To learn about how DVC allows you to do this, see +[Discovering and Accessing Data Guide](/doc/user-guide/data-management/discovering-and-accessing-data). diff --git a/content/docs/start/data-management/metrics-parameters-plots.md b/content/docs/start/data-management/metrics-parameters-plots.md index 41278d2b82..9975b6a8bc 100644 --- a/content/docs/start/data-management/metrics-parameters-plots.md +++ b/content/docs/start/data-management/metrics-parameters-plots.md @@ -46,7 +46,7 @@ $ dvc repro
-### 💡 Expand to see what happens under the hood. +### 💡 Expand to get a peek under the hood The `-O` option here specifies an output that will not be cached by DVC, and `-M` specifies a metrics file (that will also not be cached). @@ -117,7 +117,7 @@ eval/live/metrics.json 0.94496 0.97723 0.96191 0.987 ## Visualizing plots -The stage also writes different files with data that can be graphed: +The `evaluate` stage also writes different files with data that can be graphed: - [DVCLive]-generated [`roc_curve`] and [`confusion_matrix`] values in the `eval/live/plots` directory. @@ -160,9 +160,9 @@ plots: - eval/importance.png ``` -To render them, you can run `dvc plots show` (shown below), which generates an -HTML file you can open in a browser. Or you can load your project in VS Code and -use the [DVC Extension]'s [Plots Dashboard]. +To render them, run `dvc plots show` (shown below), which generates an HTML file +you can open in a browser. Or you can load your project in VS Code and use the +[DVC Extension]'s [Plots Dashboard]. ```cli $ dvc plots show diff --git a/content/docs/start/index.md b/content/docs/start/index.md index bde323bc70..7d371e7c1a 100644 --- a/content/docs/start/index.md +++ b/content/docs/start/index.md @@ -11,7 +11,8 @@ pipelines and metrics, and manage experiments.' ## Get Started with DVC --> -Before we begin, let's prepare a project for this guide +Before we begin, settle on a directory for this guide. Everything we will do +will be self contained there.
@@ -35,8 +36,11 @@ This directory name is used in our
-Assuming DVC is already [installed](/doc/install), initialize it by running -`dvc init` inside a Git project: +## Initializing a project + +Inside your chosen directory, we will use our current working directory as a +DVC project. Let's initialize it by running `dvc init` inside a Git +project: ```cli $ dvc init diff --git a/content/docs/user-guide/data-management/discovering-and-accessing-data.md b/content/docs/user-guide/data-management/discovering-and-accessing-data.md index 86845415a8..8083eaec7d 100644 --- a/content/docs/user-guide/data-management/discovering-and-accessing-data.md +++ b/content/docs/user-guide/data-management/discovering-and-accessing-data.md @@ -81,7 +81,7 @@ bring in changes from the data source later using `dvc update`.
-### 💡 Expand to see what happens under the hood. +### 💡 Expand to get a peek under the hood diff --git a/content/docs/user-guide/data-management/remote-storage/index.md b/content/docs/user-guide/data-management/remote-storage/index.md index aa0bbcf255..060aeb67f8 100644 --- a/content/docs/user-guide/data-management/remote-storage/index.md +++ b/content/docs/user-guide/data-management/remote-storage/index.md @@ -1,9 +1,10 @@ # Remote Storage -_DVC remotes_ provide optional/additional storage to back up and share your data -and ML models. For example, you can download data artifacts created by -colleagues without spending time and resources to regenerate them locally. See -also `dvc push` and `dvc pull`. +_DVC remotes_ provide access to external storage locations to track and share +your data and ML models. Usually, those will be shared between devices or team +members who are working on a project. For example, you can download data +artifacts created by colleagues without spending time and resources to +regenerate them locally. See also `dvc push` and `dvc pull`.