From 83a3157d6525a23e848fe3aa55e0d529cadd2767 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Nov 2020 20:25:03 -0800 Subject: [PATCH 01/25] guide: reorder how-tos and some copy edits --- content/docs/sidebar.json | 2 +- content/docs/use-cases/versioned-storage.md | 13 ++++++++++ ...-to-stage.md => add-outputs-to-a-stage.md} | 2 +- .../user-guide/how-to/undo-adding-data.md | 26 +++++++------------ .../user-guide/how-to/update-tracked-files.md | 2 +- 5 files changed, 25 insertions(+), 20 deletions(-) create mode 100644 content/docs/use-cases/versioned-storage.md rename content/docs/user-guide/how-to/{add-output-to-stage.md => add-outputs-to-a-stage.md} (98%) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 6794fbff8f..721abcf838 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,8 +100,8 @@ "slug": "how-to", "source": false, "children": [ - "add-output-to-stage", "undo-adding-data", + "add-outputs-to-a-stage", "update-tracked-files" ] }, diff --git a/content/docs/use-cases/versioned-storage.md b/content/docs/use-cases/versioned-storage.md new file mode 100644 index 0000000000..3ced05eacf --- /dev/null +++ b/content/docs/use-cases/versioned-storage.md @@ -0,0 +1,13 @@ +# Versioned storage + +What if we could **combine data and ML model versioning features with large file +storage** solutions like traditional hard drives, NAS, or cloud services such as +Amazon S3 and Google Drive? DVC brings together the best of both worlds by +implementing easy synchronization between the data cache and +on-premises or cloud storage for sharing. + +![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage_ + +> Note that [remote storage](/doc/command-reference/remote) is optional in DVC: +> no server setup or special services are needed, just the `dvc` command-line +> tool. diff --git a/content/docs/user-guide/how-to/add-output-to-stage.md b/content/docs/user-guide/how-to/add-outputs-to-a-stage.md similarity index 98% rename from content/docs/user-guide/how-to/add-output-to-stage.md rename to content/docs/user-guide/how-to/add-outputs-to-a-stage.md index 79e3ef292d..7a85bf396a 100644 --- a/content/docs/user-guide/how-to/add-output-to-stage.md +++ b/content/docs/user-guide/how-to/add-outputs-to-a-stage.md @@ -1,4 +1,4 @@ -# Add Output to Stage +# Add Output to a Stage There are situations where we have executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index d749f8fc5c..e5a59d747c 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -3,8 +3,8 @@ There are situations where you want to stop tracking data added previously. Follow the steps listed here to undo `dvc add`. -Let's first add a data file into an example project using -`dvc add`, which creates a `.dvc` file to track the data: +Let's first add a data file into an example project, which creates +a `.dvc` file to track the data: ```dvc $ dvc add data.csv @@ -12,32 +12,24 @@ $ ls data.csv data.csv.dvc ``` -> Note, if you are using `symlink` or `hardlink` as -> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -> for DVC cache, you will have to unprotect the tracked file first -> (see `dvc unprotect`): -> -> ```dvc -> $ dvc unprotect data.csv -> ``` +> Note, if you're using `symlink` or `hardlink` as the project's +> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), +> you'll have to unprotect the tracked file first (see `dvc unprotect`). -Now let's reverse `dvc add` by removing the corresponding `.dvc` file and -`.gitignore` entry using `dvc remove`: +Now let's reverse that with `dvc remove`. This removes the `.dvc` file (and +corresponding `.gitignore` entry). The data file is now no longer being tracked +after this: ```dvc $ dvc remove data.csv.dvc -``` - -Data file `data.csv` is now no longer being tracked by DVC. -```dvc $ git status Untracked files: data.csv ``` You can run `dvc gc` with the `-w` option to remove the data that isn't -referenced in the current workspace from the cache: +referenced in the current workspace from the cache: ```dvc $ dvc gc -w diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md index e74a1d06b6..554974d263 100644 --- a/content/docs/user-guide/how-to/update-tracked-files.md +++ b/content/docs/user-guide/how-to/update-tracked-files.md @@ -1,4 +1,4 @@ -# Updating Tracked Files +# Update Tracked Files Due to the way DVC handles linking between the data files between the cache and their counterparts in the workspace (refer From 82812e2424882eca3fa5e17fb6ff1d3a5cba1517 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 10 Nov 2020 15:07:55 -0800 Subject: [PATCH 02/25] cases: remove unnecessary file --- content/docs/use-cases/versioned-storage.md | 13 ------------- 1 file changed, 13 deletions(-) delete mode 100644 content/docs/use-cases/versioned-storage.md diff --git a/content/docs/use-cases/versioned-storage.md b/content/docs/use-cases/versioned-storage.md deleted file mode 100644 index 3ced05eacf..0000000000 --- a/content/docs/use-cases/versioned-storage.md +++ /dev/null @@ -1,13 +0,0 @@ -# Versioned storage - -What if we could **combine data and ML model versioning features with large file -storage** solutions like traditional hard drives, NAS, or cloud services such as -Amazon S3 and Google Drive? DVC brings together the best of both worlds by -implementing easy synchronization between the data cache and -on-premises or cloud storage for sharing. - -![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage_ - -> Note that [remote storage](/doc/command-reference/remote) is optional in DVC: -> no server setup or special services are needed, just the `dvc` command-line -> tool. From d82b83a7d68ef450f2c74da938db71f419314b98 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Wed, 11 Nov 2020 22:38:54 +0530 Subject: [PATCH 03/25] adding dependencies to a stage --- .../how-to/add-outputs-to-a-stage.md | 28 +++++++++++++------ 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/content/docs/user-guide/how-to/add-outputs-to-a-stage.md b/content/docs/user-guide/how-to/add-outputs-to-a-stage.md index 7a85bf396a..432f61f872 100644 --- a/content/docs/user-guide/how-to/add-outputs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-outputs-to-a-stage.md @@ -2,20 +2,29 @@ There are situations where we have executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice -that some of the output files or directories it creates, which are already in -the workspace, are missing from `dvc.yaml` (`outs` field). Follow -the steps below to add existing files or directories as outputs to -a stage without re-executing it again, which can be expensive/time-consuming, -and is unnecessary. +that some of the build requirements are missing from `dvc.yaml`: -We start with an example `prepare`, which has a single output. To add a missing -output `data/validate` to this stage, we can edit `dvc.yaml` like this: +- Files or directories in the workspace that are dependencies of + the stage, are missing from `deps` field. + +- Output files or directories that the stage creates, which are already in the + workspace, are missing from `outs` field. + +Follow the steps below to add existing files/directories as +dependencies or outputs to a stage without +re-executing it again, which can be expensive/time-consuming, and is +unnecessary. + +We start with an example `prepare`, which has a single dependency and output. To +add a missing dependency `data.csv`, and output `data/validate` to this stage, +we can edit `dvc.yaml` like this: ```git stages: prepare: cmd: python src/prepare.py deps: ++ - data.csv - src/prepare.py outs: - data/train @@ -23,11 +32,12 @@ output `data/validate` to this stage, we can edit `dvc.yaml` like this: ``` > Note that you can also use `dvc run` with the `-f` and `--no-exec` options to -> add another output to the stage: +> add another dependency/output to the stage: > > ```dvc > $ dvc run -f --no-exec \ > -n prepare \ +> -d data.csv \ > -d src/prepare.py \ > -o data/train \ > -o data/validate \ @@ -38,7 +48,7 @@ output `data/validate` to this stage, we can edit `dvc.yaml` like this: > without executing it. Finally, we need to run `dvc commit` to save the newly specified output(s) to -the cache (and to update the corresponding hash values in +the cache (and to update the hash values of `deps` and `outs` in `dvc.lock`): ```dvc From 17e40d24063f561c5a912711ffd19b6f17fe51bb Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Wed, 11 Nov 2020 23:02:07 +0530 Subject: [PATCH 04/25] How-to title change --- content/docs/sidebar.json | 2 +- ...add-outputs-to-a-stage.md => add-deps-or-outs-to-a-stage.md} | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) rename content/docs/user-guide/how-to/{add-outputs-to-a-stage.md => add-deps-or-outs-to-a-stage.md} (97%) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 721abcf838..efc7899783 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -101,7 +101,7 @@ "source": false, "children": [ "undo-adding-data", - "add-outputs-to-a-stage", + "add-deps-or-outs-to-a-stage", "update-tracked-files" ] }, diff --git a/content/docs/user-guide/how-to/add-outputs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md similarity index 97% rename from content/docs/user-guide/how-to/add-outputs-to-a-stage.md rename to content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 432f61f872..2590135872 100644 --- a/content/docs/user-guide/how-to/add-outputs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -1,4 +1,4 @@ -# Add Output to a Stage +# Add Dependencies or Outputs to a Stage There are situations where we have executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice From 20e932cf1a5e586a31810a25f908842175c2d95b Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Wed, 11 Nov 2020 23:03:21 +0530 Subject: [PATCH 05/25] How-to reference update in commit/run --- content/docs/command-reference/commit.md | 8 ++++---- content/docs/command-reference/run.md | 7 ++++--- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 640c2128f9..135973bc97 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -42,10 +42,10 @@ scenarios are further detailed below. - In cases where we have previously executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later - notice that some of the output files or directories it creates, which are - already in the workspace, are missing from `dvc.yaml` (`outs` - field). We can - [add missing outputs to an existing stage](/docs/user-guide/how-to/add-output-to-stage) + notice that some of the existing dependencies, or output files/directories it + creates, which are already in the workspace, are missing from + `dvc.yaml` (`deps` and `outs` field respectively). We can + [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 88b1ec9825..4bfc9e1c87 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -107,9 +107,10 @@ Relevant notes: defined as outputs every time its executed by DVC. - In some situations we have executed a stage and later notice that some of the - output files or directories it creates, which are already in the workspace, - are missing from `dvc.yaml` (`outs` field). We can - [add missing outputs to an existing stage](/docs/user-guide/how-to/add-output-to-stage) + existing dependencies, or output files/directories it creates, which are + already in the workspace, are missing from `dvc.yaml` (`deps` and `outs` field + respectively). We can + [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. - Renaming dependencies or outputs requires a From 77871e6ad2b3fb92620b69a44aa475124d843fea Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 11 Nov 2020 20:43:35 -0600 Subject: [PATCH 06/25] Update content/docs/sidebar.json --- content/docs/sidebar.json | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index efc7899783..2d118b1cbf 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -101,7 +101,10 @@ "source": false, "children": [ "undo-adding-data", - "add-deps-or-outs-to-a-stage", + { + "label": "Add Tracked Data to a Stage", + "slug": "add-deps-or-outs-to-a-stage" + }, "update-tracked-files" ] }, From f578ddd06f32b6b660a5ff5917658d771ca76474 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 11 Nov 2020 21:19:01 -0600 Subject: [PATCH 07/25] Update content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md --- content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 2590135872..3065689481 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -5,8 +5,7 @@ There are situations where we have executed a stage (either by writing that some of the build requirements are missing from `dvc.yaml`: - Files or directories in the workspace that are dependencies of - the stage, are missing from `deps` field. - + the stage are missing from `deps` field. - Output files or directories that the stage creates, which are already in the workspace, are missing from `outs` field. From 4abbee5ed704254131093d19f719b036f0eb74e0 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Fri, 13 Nov 2020 00:35:13 +0530 Subject: [PATCH 08/25] minor updates --- content/docs/command-reference/commit.md | 6 ++--- content/docs/command-reference/run.md | 6 ++--- .../how-to/add-deps-or-outs-to-a-stage.md | 22 +++++++++---------- 3 files changed, 17 insertions(+), 17 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 135973bc97..7ac4b57ec8 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -42,9 +42,9 @@ scenarios are further detailed below. - In cases where we have previously executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later - notice that some of the existing dependencies, or output files/directories it - creates, which are already in the workspace, are missing from - `dvc.yaml` (`deps` and `outs` field respectively). We can + notice that some of the existing dependencies of the stage or output + files/directories it creates, which are already in the workspace, + are missing from `dvc.yaml` (`deps` and `outs` field respectively). We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 4bfc9e1c87..972646a3d5 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -107,9 +107,9 @@ Relevant notes: defined as outputs every time its executed by DVC. - In some situations we have executed a stage and later notice that some of the - existing dependencies, or output files/directories it creates, which are - already in the workspace, are missing from `dvc.yaml` (`deps` and `outs` field - respectively). We can + existing dependencies of the stage or output files/directories it creates, + which are already in the workspace, are missing from `dvc.yaml` (`deps` and + `outs` field respectively). We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 3065689481..5f3c27dc55 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -2,28 +2,28 @@ There are situations where we have executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice -that some of the build requirements are missing from `dvc.yaml`: +that some of its data requirements are missing from `dvc.yaml`. Namely, one of: - Files or directories in the workspace that are dependencies of - the stage are missing from `deps` field. + the stage are missing from `deps`. + - Output files or directories that the stage creates, which are already in the - workspace, are missing from `outs` field. + workspace are missing from `outs`. Follow the steps below to add existing files/directories as -dependencies or outputs to a stage without -re-executing it again, which can be expensive/time-consuming, and is -unnecessary. +dependencies or outputs to a stage without executing +it again, which can be expensive/time-consuming, and is unnecessary. -We start with an example `prepare`, which has a single dependency and output. To -add a missing dependency `data.csv`, and output `data/validate` to this stage, -we can edit `dvc.yaml` like this: +We start with an example stage `prepare`, which has a single dependency and +output. To add a missing dependency `data/raw.csv` and output `data/validate` to +this stage, we can edit `dvc.yaml` like this: ```git stages: prepare: cmd: python src/prepare.py deps: -+ - data.csv ++ - data/raw.csv - src/prepare.py outs: - data/train @@ -36,7 +36,7 @@ we can edit `dvc.yaml` like this: > ```dvc > $ dvc run -f --no-exec \ > -n prepare \ -> -d data.csv \ +> -d data/raw.csv \ > -d src/prepare.py \ > -o data/train \ > -o data/validate \ From cec2090fce121e1d05760c0b5831376d5a953930 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Fri, 13 Nov 2020 15:10:21 +0530 Subject: [PATCH 09/25] minor updates --- content/docs/command-reference/commit.md | 7 ++++--- content/docs/command-reference/run.md | 6 +++--- .../docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md | 7 +++---- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 7ac4b57ec8..6ad4bc5191 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -42,9 +42,10 @@ scenarios are further detailed below. - In cases where we have previously executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later - notice that some of the existing dependencies of the stage or output - files/directories it creates, which are already in the workspace, - are missing from `dvc.yaml` (`deps` and `outs` field respectively). We can + notice that some of the existing files/directories used by the stage as + dependencies, or created as outputs, which are already in the + workspace, are missing from `dvc.yaml` (`deps` and `outs` field + respectively). We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 972646a3d5..060fd5f524 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -107,9 +107,9 @@ Relevant notes: defined as outputs every time its executed by DVC. - In some situations we have executed a stage and later notice that some of the - existing dependencies of the stage or output files/directories it creates, - which are already in the workspace, are missing from `dvc.yaml` (`deps` and - `outs` field respectively). We can + existing files/directories used by the stage as dependencies, or created as + outputs, which are already in the workspace, are missing from `dvc.yaml` + (`deps` and `outs` field respectively). We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 5f3c27dc55..cf6bba4b85 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -6,7 +6,6 @@ that some of its data requirements are missing from `dvc.yaml`. Namely, one of: - Files or directories in the workspace that are dependencies of the stage are missing from `deps`. - - Output files or directories that the stage creates, which are already in the workspace are missing from `outs`. @@ -14,9 +13,9 @@ Follow the steps below to add existing files/directories as dependencies or outputs to a stage without executing it again, which can be expensive/time-consuming, and is unnecessary. -We start with an example stage `prepare`, which has a single dependency and -output. To add a missing dependency `data/raw.csv` and output `data/validate` to -this stage, we can edit `dvc.yaml` like this: +We start with an example `prepare` stage, which has a single dependency and +output. To add a missing dependency (`data/raw.csv`) as well as a missing output +(`data/validate`) to this stage, we can edit `dvc.yaml` like this: ```git stages: From e005c2d79c7628e1456a26e3f664663e36186a47 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Wed, 18 Nov 2020 12:23:29 +0530 Subject: [PATCH 10/25] run/commit Updates --- content/docs/command-reference/commit.md | 7 +++---- content/docs/command-reference/run.md | 6 +++--- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 6ad4bc5191..5183b83eef 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -42,10 +42,9 @@ scenarios are further detailed below. - In cases where we have previously executed a stage (either by writing `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later - notice that some of the existing files/directories used by the stage as - dependencies, or created as outputs, which are already in the - workspace, are missing from `dvc.yaml` (`deps` and `outs` field - respectively). We can + notice that some of the files/directories used by the stage as dependencies, + or created as outputs, which are already in the workspace, are + missing from `dvc.yaml` (`deps` and `outs` field respectively). We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 060fd5f524..8c1871b6d0 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -107,9 +107,9 @@ Relevant notes: defined as outputs every time its executed by DVC. - In some situations we have executed a stage and later notice that some of the - existing files/directories used by the stage as dependencies, or created as - outputs, which are already in the workspace, are missing from `dvc.yaml` - (`deps` and `outs` field respectively). We can + files/directories used by the stage as dependencies, or created as outputs, + which are already in the workspace, are missing from `dvc.yaml` (`deps` and + `outs` field respectively). We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. From 7f7008525565683f81b658e2f169abf96a3bb8da Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Thu, 19 Nov 2020 22:39:49 +0530 Subject: [PATCH 11/25] removing extra information from run/commit --- content/docs/command-reference/commit.md | 8 +++----- content/docs/command-reference/run.md | 3 +-- content/docs/sidebar.json | 8 ++++---- 3 files changed, 8 insertions(+), 11 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 5183b83eef..9d40959f0b 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -40,11 +40,9 @@ scenarios are further detailed below. reproduce the whole pipeline. If you're sure no pipeline results would change, use `dvc commit` to force update the `dvc.lock` or `.dvc` files and cache. -- In cases where we have previously executed a stage (either by writing - `dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later - notice that some of the files/directories used by the stage as dependencies, - or created as outputs, which are already in the workspace, are - missing from `dvc.yaml` (`deps` and `outs` field respectively). We can +- In cases where we have previously executed a stage and later notice that some + of the files/directories used by the stage as dependencies, or created as + outputs, are missing from `dvc.yaml`. We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 8c1871b6d0..b18cc7ff64 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -108,8 +108,7 @@ Relevant notes: - In some situations we have executed a stage and later notice that some of the files/directories used by the stage as dependencies, or created as outputs, - which are already in the workspace, are missing from `dvc.yaml` (`deps` and - `outs` field respectively). We can + are missing from `dvc.yaml`. We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 2d118b1cbf..cc422e4512 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -101,10 +101,10 @@ "source": false, "children": [ "undo-adding-data", - { - "label": "Add Tracked Data to a Stage", - "slug": "add-deps-or-outs-to-a-stage" - }, + { + "label": "Add Tracked Data to a Stage", + "slug": "add-deps-or-outs-to-a-stage" + }, "update-tracked-files" ] }, From 2968e79d057fcd79531c54aaea4119c35ce369ed Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 20 Nov 2020 01:58:30 -0600 Subject: [PATCH 12/25] guide: update how-to add deps/outs to stage title per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-534020229 --- content/docs/sidebar.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index cc422e4512..08329df7a5 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -102,7 +102,7 @@ "children": [ "undo-adding-data", { - "label": "Add Tracked Data to a Stage", + "label": "Add Deps or Outs to a Stage", "slug": "add-deps-or-outs-to-a-stage" }, "update-tracked-files" From 28b581b6187881ae43069d1f62e22a32d3e03945 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 20 Nov 2020 02:02:39 -0600 Subject: [PATCH 13/25] guide: remove label since how-to add des/outs slug is the same per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-534020229 --- content/docs/sidebar.json | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 08329df7a5..efc7899783 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -101,10 +101,7 @@ "source": false, "children": [ "undo-adding-data", - { - "label": "Add Deps or Outs to a Stage", - "slug": "add-deps-or-outs-to-a-stage" - }, + "add-deps-or-outs-to-a-stage", "update-tracked-files" ] }, From 964571ec311424861c01eabd3a4a3790c612026c Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Fri, 20 Nov 2020 17:09:38 +0530 Subject: [PATCH 14/25] Updates to run and commit --- content/docs/command-reference/commit.md | 11 +++++------ content/docs/command-reference/run.md | 6 +++--- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 9d40959f0b..c4f62b97d7 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -40,12 +40,11 @@ scenarios are further detailed below. reproduce the whole pipeline. If you're sure no pipeline results would change, use `dvc commit` to force update the `dvc.lock` or `.dvc` files and cache. -- In cases where we have previously executed a stage and later notice that some - of the files/directories used by the stage as dependencies, or created as - outputs, are missing from `dvc.yaml`. We can - [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) - without having to execute it again. Use `dvc commit` to update the `dvc.lock` - file and save outputs to the cache. +- In some cases, we have previously executed a stage, and later notice that some + of the files/directories used by the stage as dependencies or created as + outputs are missing from `dvc.yaml`. It is possible to + [add missing data to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage). + Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. - It's always possible to manually execute the command or source code used in a stage without DVC (outputs must be unprotected or removed first in certain diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index b18cc7ff64..6c97e5033f 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -106,9 +106,9 @@ Relevant notes: also means that the stage command needs to recreate any directory structures defined as outputs every time its executed by DVC. -- In some situations we have executed a stage and later notice that some of the - files/directories used by the stage as dependencies, or created as outputs, - are missing from `dvc.yaml`. We can +- In some situations, we have previously executed a stage, and later notice that + some of the files/directories are missing from the `deps` or `outs` fields of + `dvc.yaml`. We can [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. From 78df111df22fe1d31660062f9becae17f28215f1 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Sat, 21 Nov 2020 19:59:13 +0530 Subject: [PATCH 15/25] How-to simplification --- content/docs/command-reference/commit.md | 5 +++-- content/docs/command-reference/run.md | 4 ++-- .../how-to/add-deps-or-outs-to-a-stage.md | 22 +++++++++---------- 3 files changed, 15 insertions(+), 16 deletions(-) diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index c4f62b97d7..12951b2ac7 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -43,8 +43,9 @@ scenarios are further detailed below. - In some cases, we have previously executed a stage, and later notice that some of the files/directories used by the stage as dependencies or created as outputs are missing from `dvc.yaml`. It is possible to - [add missing data to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage). - Use `dvc commit` to update the `dvc.lock` file and save outputs to the cache. + [add missing data to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage), + and then `dvc commit` can be used to save outputs to the cache (and update + `dvc.lock`) - It's always possible to manually execute the command or source code used in a stage without DVC (outputs must be unprotected or removed first in certain diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 6c97e5033f..3fc6a88af7 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -107,8 +107,8 @@ Relevant notes: defined as outputs every time its executed by DVC. - In some situations, we have previously executed a stage, and later notice that - some of the files/directories are missing from the `deps` or `outs` fields of - `dvc.yaml`. We can + some of the files/directories used by the stage as dependencies, or created as + outputs are missing from `dvc.yaml`. It is possible to [add missing dependencies/outputs to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage) without having to execute it again. diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index cf6bba4b85..88eb94125b 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -1,17 +1,15 @@ # Add Dependencies or Outputs to a Stage -There are situations where we have executed a stage (either by writing -`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice -that some of its data requirements are missing from `dvc.yaml`. Namely, one of: - -- Files or directories in the workspace that are dependencies of - the stage are missing from `deps`. -- Output files or directories that the stage creates, which are already in the - workspace are missing from `outs`. - -Follow the steps below to add existing files/directories as -dependencies or outputs to a stage without executing -it again, which can be expensive/time-consuming, and is unnecessary. +There are situations where we have executed a stage, but later notice that some +of the files/directories used by the stage as dependencies or created as outputs +are missing from `dvc.yaml`. + +To add existing files/directories as dependencies or +outputs to a stage, either edit the `dvc.yaml` file or use +`dvc run` with the `--no-exec` option. Then use `dvc commit` to save the +output(s) to the cache (and update `dvc.lock`). + +## Example We start with an example `prepare` stage, which has a single dependency and output. To add a missing dependency (`data/raw.csv`) as well as a missing output From 41d86dbef55a95b39be0c144089510a6f5bbeae7 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Mon, 23 Nov 2020 03:17:52 +0530 Subject: [PATCH 16/25] How-to description update --- .../how-to/add-deps-or-outs-to-a-stage.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 88eb94125b..81d0150677 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -1,13 +1,14 @@ # Add Dependencies or Outputs to a Stage -There are situations where we have executed a stage, but later notice that some -of the files/directories used by the stage as dependencies or created as outputs -are missing from `dvc.yaml`. +To add files/directories as dependencies or outputs to +a stage without executing it, either edit the `dvc.yaml` file or use `dvc run` +with the `--no-exec` option. -To add existing files/directories as dependencies or -outputs to a stage, either edit the `dvc.yaml` file or use -`dvc run` with the `--no-exec` option. Then use `dvc commit` to save the -output(s) to the cache (and update `dvc.lock`). +There are situations where we have executed a stage, but later notice that some +of the existing files/directories used by the stage as dependencies or created +as outputs are missing from `dvc.yaml`. We can add the existing +files/directories to a stage, and then use `dvc commit` to save the output(s) to +the cache (and update `dvc.lock`). ## Example @@ -44,8 +45,7 @@ output. To add a missing dependency (`data/raw.csv`) as well as a missing output > without executing it. Finally, we need to run `dvc commit` to save the newly specified output(s) to -the cache (and to update the hash values of `deps` and `outs` in -`dvc.lock`): +the cache (and to update the hash values of `deps` and `outs` in `dvc.lock`): ```dvc $ dvc commit From dabb188a872ad6b6bc69c65599bf0641c3efac7d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 24 Nov 2020 19:20:15 -0600 Subject: [PATCH 17/25] how-to: generalize adding depts/outs to existing stages per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-536102514 --- content/docs/sidebar.json | 9 ++++-- .../how-to/add-deps-or-outs-to-a-stage.md | 28 +++++++++++-------- 2 files changed, 22 insertions(+), 15 deletions(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index efc7899783..fecf6ce8be 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,9 +100,12 @@ "slug": "how-to", "source": false, "children": [ - "undo-adding-data", - "add-deps-or-outs-to-a-stage", - "update-tracked-files" + { + "label": "Un-track Data", + "slug": "untrack-data" + }, + "update-tracked-files", + "add-deps-or-outs-to-a-stage" ] }, "setup-google-drive-remote", diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 81d0150677..c57a912318 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -1,20 +1,23 @@ # Add Dependencies or Outputs to a Stage To add files/directories as dependencies or outputs to -a stage without executing it, either edit the `dvc.yaml` file or use `dvc run` -with the `--no-exec` option. +a stage without executing it (which can be expensive/time-consuming, and is +unnecessary) you can either edit the `dvc.yaml` file directly, or use `dvc run` +with the `-f` and `--no-exec` options to the same end. -There are situations where we have executed a stage, but later notice that some -of the existing files/directories used by the stage as dependencies or created -as outputs are missing from `dvc.yaml`. We can add the existing -files/directories to a stage, and then use `dvc commit` to save the output(s) to -the cache (and update `dvc.lock`). +After updating `dvc.yaml`, use `dvc commit` to save any output file(s) that +already exist in the workspace to the cache (and +to update `dvc.lock`). + +> This could be a need for example after executing a stage, but later noticing +> that some of the files/directories it uses as dependencies or creates as +> outputs are missing from `dvc.yaml`. ## Example We start with an example `prepare` stage, which has a single dependency and output. To add a missing dependency (`data/raw.csv`) as well as a missing output -(`data/validate`) to this stage, we can edit `dvc.yaml` like this: +(`data/validate`), we can edit `dvc.yaml` like this: ```git stages: @@ -28,8 +31,8 @@ output. To add a missing dependency (`data/raw.csv`) as well as a missing output + - data/validate ``` -> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to -> add another dependency/output to the stage: +> We could also use `dvc run` with `-f` and `--no-exec` to add another +> dependency/output to the stage: > > ```dvc > $ dvc run -f --no-exec \ @@ -44,8 +47,9 @@ output. To add a missing dependency (`data/raw.csv`) as well as a missing output > `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage > without executing it. -Finally, we need to run `dvc commit` to save the newly specified output(s) to -the cache (and to update the hash values of `deps` and `outs` in `dvc.lock`): +If the `data/raw.csv` or `data/validate` files exist, we also need to use +`dvc commit` to save the newly specified deps and outs to the cache +(and to update the hash values of `deps` and `outs` in `dvc.lock`): ```dvc $ dvc commit From f77e100b1e5f6b8dd838a0bcd2e9b1894ed1ff48 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 24 Nov 2020 19:28:48 -0600 Subject: [PATCH 18/25] how-to: rearrange and rename them, copy edits and simplfications --- .../how-to/add-deps-or-outs-to-a-stage.md | 19 +++-- .../user-guide/how-to/undo-adding-data.md | 36 --------- .../docs/user-guide/how-to/untrack-data.md | 45 +++++++++++ .../user-guide/how-to/update-tracked-files.md | 79 +++++++++---------- 4 files changed, 91 insertions(+), 88 deletions(-) delete mode 100644 content/docs/user-guide/how-to/undo-adding-data.md create mode 100644 content/docs/user-guide/how-to/untrack-data.md diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index c57a912318..10ef80f549 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -1,17 +1,16 @@ -# Add Dependencies or Outputs to a Stage +# How to Add Dependencies or Outputs -To add files/directories as dependencies or outputs to -a stage without executing it (which can be expensive/time-consuming, and is +We have executed a stage, but later notice that some of the files/directories it +uses as dependencies or creates as outputs are missing from `dvc.yaml`... + +To add files or directories as dependencies or outputs +to an existing stage without re-executing it (which can be expensive and is unnecessary) you can either edit the `dvc.yaml` file directly, or use `dvc run` with the `-f` and `--no-exec` options to the same end. -After updating `dvc.yaml`, use `dvc commit` to save any output file(s) that -already exist in the workspace to the cache (and -to update `dvc.lock`). - -> This could be a need for example after executing a stage, but later noticing -> that some of the files/directories it uses as dependencies or creates as -> outputs are missing from `dvc.yaml`. +After the update, use `dvc commit` to save any output files that already exist +in the workspace to the cache (and to update +`dvc.lock`). ## Example diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md deleted file mode 100644 index e5a59d747c..0000000000 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ /dev/null @@ -1,36 +0,0 @@ -# Undo Adding Data - -There are situations where you want to stop tracking data added previously. -Follow the steps listed here to undo `dvc add`. - -Let's first add a data file into an example project, which creates -a `.dvc` file to track the data: - -```dvc -$ dvc add data.csv -$ ls -data.csv data.csv.dvc -``` - -> Note, if you're using `symlink` or `hardlink` as the project's -> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), -> you'll have to unprotect the tracked file first (see `dvc unprotect`). - -Now let's reverse that with `dvc remove`. This removes the `.dvc` file (and -corresponding `.gitignore` entry). The data file is now no longer being tracked -after this: - -```dvc -$ dvc remove data.csv.dvc - -$ git status - Untracked files: - data.csv -``` - -You can run `dvc gc` with the `-w` option to remove the data that isn't -referenced in the current workspace from the cache: - -```dvc -$ dvc gc -w -``` diff --git a/content/docs/user-guide/how-to/untrack-data.md b/content/docs/user-guide/how-to/untrack-data.md new file mode 100644 index 0000000000..d90505c1fe --- /dev/null +++ b/content/docs/user-guide/how-to/untrack-data.md @@ -0,0 +1,45 @@ +# How to Un-track Data + +There are situations where you may want to stop tracking data added previously. +Let's see how it can be done using an example `data.csv` file. + +
+ +## Click to add the sample data first + +Let's `dvc add` a `data.csv` file into an example project, which +creates a `.dvc` file to track the data and adds it to `.gitignore`: + +```dvc +$ dvc add data.csv + +$ ls +data.csv data.csv.dvc +$ cat .gitignore +/data.csv +``` + +
+ +> Note that if you are using `symlink` or `hardlink` as +> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), +> you will have to `dvc unprotect` the tracked file first. + +Let's undo `dvc add` with `dvc remove`. This removes the `.dvc` file (and +corresponding `.gitignore` entry). The data file is now no longer being tracked +after this: + +```dvc +$ dvc remove data.csv.dvc + +$ git status + Untracked files: + data.csv +``` + +You can run `dvc gc` with the `-w` option to remove the data that isn't +referenced in the current workspace from the cache: + +```dvc +$ dvc gc -w +``` diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md index 554974d263..6befe7b85f 100644 --- a/content/docs/user-guide/how-to/update-tracked-files.md +++ b/content/docs/user-guide/how-to/update-tracked-files.md @@ -1,76 +1,71 @@ -# Update Tracked Files +# How to Update Tracked Files -Due to the way DVC handles linking between the data files between the -cache and their counterparts in the workspace (refer -to [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization)), -updating tracked files has to be carried out with caution to avoid data -corruption when the DVC config option `cache.type` is set to `hardlink` or/and -`symlink`. (See `dvc config cache` for more details on setting the cache file -link types.) +Updating a tracked data file (or directory) may mean either +[modifying](#modifying-content) some of its contents, or completely +[replacing](#replacing-file) it with a new one (same file name). -> For an example of the cache corruption problem see -> [issue #599](https://github.com/iterative/dvc/issues/599) in our GitHub -> repository. +When the `cache.type` config option is set to `symlink` or `hardlink` (not the +default, see `dvc config cache` for more info.), updating tracked files has to +be carried out with caution, to avoid data corruption. This is due to the way in +which DVC handles linking data files between the cache and the +workspace (refer to +[Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) for +details). -Assume `train.tsv` is tracked by DVC and you want to update it. Here updating -may mean either replacing `train.tsv` with a new file having the same name or -editing the content of the file. +> For an example of the cache corruption problem see +> [issue #599](https://github.com/iterative/dvc/issues/599) in our GitHub repo. -If you run `dvc repro` there is no need to manage generated (output) files -manually. DVC removes them for you before executing the stage that generates +If you use `dvc.yaml` files and `dvc repro`, there is no need to manage stage +outputs manually. DVC removes them for you before regenerating them. -If you use DVC to track a file that is generated during your pipeline (e.g. some -intermediate result or a final model file i.e. `model.pkl`) and you don't use -`dvc run` and `dvc repro` to manage your pipeline, use the procedure below (run -`dvc unprotect` or `dvc remove`) to unlink it from DVC cache prior to the -execution of the script that modifies it. - -See also `dvc unprotect` and `dvc config cache` to learn more about protecting -your data files. +Otherwise (the data was tracked with `dvc add`), use one of the procedures below +to unlink the data from the cache prior to updating it. We'll be working with a +`train.tsv` file: -## Replacing file - -If you want to replace the file, you can take the following steps. +## Modifying content -First, un-track the file. This will remove `train.tsv` from the workspace: +"Unlink" the file with `dvc unprotect`. This will make `train.tsv` safe to edit: ```dvc -$ dvc remove train.tsv.dvc +$ dvc unprotect train.tsv ``` -Next, replace the file with new content: +Then edit the content of the file: ```dvc -$ echo new > train.tsv +$ echo "new data item" >> train.tsv ``` -And start tracking it again: +Add the new version of the file back with DVC: ```dvc $ dvc add train.tsv -$ git add train.tsv.dvc .gitignore -$ git commit -m "new train data" +$ git add train.tsv.dvc +$ git commit -m "modify train data" ``` -## Modifying content +## Replacing files -"Unlink" the file with `dvc unprotect`. This will make `train.tsv` safe to edit: +If you want to replace the file, you can take the following steps. + +First, [un-track](/doc/user-guide/how-to/untrack-data) the file with +`dvc remove`. This will remove `train.tsv` from the workspace: ```dvc -$ dvc unprotect train.tsv +$ dvc remove train.tsv.dvc ``` -Edit the content of the file: +Next, replace the file with new content: ```dvc -$ echo "new data item" >> train.tsv +$ echo new > train.tsv ``` -Add the new version of the file back with DVC: +And start tracking it again: ```dvc $ dvc add train.tsv -$ git add train.tsv.dvc -$ git commit -m "modify train data" +$ git add train.tsv.dvc .gitignore +$ git commit -m "new train data" ``` From 5bb6255d89d976cea4a4006d75bf7e381eba5e2c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 24 Nov 2020 23:27:25 -0600 Subject: [PATCH 19/25] how-to: Un-track -> Stop Tracking per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-538092146 --- content/docs/sidebar.json | 5 +---- .../how-to/{untrack-data.md => stop-tracking-data.md} | 2 +- content/docs/user-guide/how-to/update-tracked-files.md | 2 +- 3 files changed, 3 insertions(+), 6 deletions(-) rename content/docs/user-guide/how-to/{untrack-data.md => stop-tracking-data.md} (97%) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index fecf6ce8be..74475695d1 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,10 +100,7 @@ "slug": "how-to", "source": false, "children": [ - { - "label": "Un-track Data", - "slug": "untrack-data" - }, + "stop-tracking-data", "update-tracked-files", "add-deps-or-outs-to-a-stage" ] diff --git a/content/docs/user-guide/how-to/untrack-data.md b/content/docs/user-guide/how-to/stop-tracking-data.md similarity index 97% rename from content/docs/user-guide/how-to/untrack-data.md rename to content/docs/user-guide/how-to/stop-tracking-data.md index d90505c1fe..5cc7bdc516 100644 --- a/content/docs/user-guide/how-to/untrack-data.md +++ b/content/docs/user-guide/how-to/stop-tracking-data.md @@ -1,4 +1,4 @@ -# How to Un-track Data +# How to Stop Tracking Data There are situations where you may want to stop tracking data added previously. Let's see how it can be done using an example `data.csv` file. diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md index 6befe7b85f..ac6c634b1c 100644 --- a/content/docs/user-guide/how-to/update-tracked-files.md +++ b/content/docs/user-guide/how-to/update-tracked-files.md @@ -49,7 +49,7 @@ $ git commit -m "modify train data" If you want to replace the file, you can take the following steps. -First, [un-track](/doc/user-guide/how-to/untrack-data) the file with +First, [stop tracking](/doc/user-guide/how-to/stop-tracking-data) the file with `dvc remove`. This will remove `train.tsv` from the workspace: ```dvc From 2fbcc9bb6f3d582282dfae6fe7aa970e3b14b03d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 24 Nov 2020 23:37:44 -0600 Subject: [PATCH 20/25] how-to: remove already unprotects + clarify about gc per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-538156666 --- .../docs/user-guide/how-to/stop-tracking-data.md | 8 ++------ .../user-guide/how-to/update-tracked-files.md | 15 ++++++++------- 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/content/docs/user-guide/how-to/stop-tracking-data.md b/content/docs/user-guide/how-to/stop-tracking-data.md index 5cc7bdc516..4003e99e07 100644 --- a/content/docs/user-guide/how-to/stop-tracking-data.md +++ b/content/docs/user-guide/how-to/stop-tracking-data.md @@ -21,10 +21,6 @@ $ cat .gitignore -> Note that if you are using `symlink` or `hardlink` as -> [link type](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), -> you will have to `dvc unprotect` the tracked file first. - Let's undo `dvc add` with `dvc remove`. This removes the `.dvc` file (and corresponding `.gitignore` entry). The data file is now no longer being tracked after this: @@ -37,8 +33,8 @@ $ git status data.csv ``` -You can run `dvc gc` with the `-w` option to remove the data that isn't -referenced in the current workspace from the cache: +You can run `dvc gc` with the `-w` option to remove the data (and all of it's +previous versions, if any) from the cache: ```dvc $ dvc gc -w diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md index ac6c634b1c..60bcbd33bd 100644 --- a/content/docs/user-guide/how-to/update-tracked-files.md +++ b/content/docs/user-guide/how-to/update-tracked-files.md @@ -20,18 +20,18 @@ If you use `dvc.yaml` files and `dvc repro`, there is no need to manage stage them. Otherwise (the data was tracked with `dvc add`), use one of the procedures below -to unlink the data from the cache prior to updating it. We'll be working with a -`train.tsv` file: +to "unlink" the data from the cache prior to updating it. We'll be working with +a `train.tsv` file: ## Modifying content -"Unlink" the file with `dvc unprotect`. This will make `train.tsv` safe to edit: +Unlink the file with `dvc unprotect`. This will make `train.tsv` safe to edit: ```dvc $ dvc unprotect train.tsv ``` -Then edit the content of the file: +Then edit the content of the file, for example with: ```dvc $ echo "new data item" >> train.tsv @@ -47,10 +47,11 @@ $ git commit -m "modify train data" ## Replacing files -If you want to replace the file, you can take the following steps. +If you want to replace the file altogether, you can take the following steps. -First, [stop tracking](/doc/user-guide/how-to/stop-tracking-data) the file with -`dvc remove`. This will remove `train.tsv` from the workspace: +First, [stop tracking](/doc/user-guide/how-to/stop-tracking-data) the file by +using `dvc remove` on the `.dvc` file. This will remove `train.tsv` from the +workspace (and unlink it from the cache): ```dvc $ dvc remove train.tsv.dvc From b741d7c6251d760c5d40f2703aa8e94b39650fd2 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 25 Nov 2020 00:22:56 -0600 Subject: [PATCH 21/25] how-to: clarify conditional note about existing out files (added to stage) per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-538154904 --- .../docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 10ef80f549..1333d2d784 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -8,9 +8,9 @@ to an existing stage without re-executing it (which can be expensive and is unnecessary) you can either edit the `dvc.yaml` file directly, or use `dvc run` with the `-f` and `--no-exec` options to the same end. -After the update, use `dvc commit` to save any output files that already exist -in the workspace to the cache (and to update -`dvc.lock`). +If some output files already exist in the workspace, you can use +`dvc commit` after the update, to save them to the cache (and to +update `dvc.lock`). ## Example From 590789e34cbca73a97e362e52d5cf9c2dcb4bd9d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 25 Nov 2020 16:59:40 -0600 Subject: [PATCH 22/25] how-to: simplify add deps/outs to stages and add SEO fields perhttps://github.com/iterative/dvc.org/pull/1914#pullrequestreview-538204006 etc. --- .../how-to/add-deps-or-outs-to-a-stage.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 1333d2d784..8546a7644f 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -1,16 +1,21 @@ +--- +title: 'How to Add Dependencies or Outputs to a Stage ' +description: 'We have executed a stage, but later notice that some of the +dependencies or outputs are missing...' +--- + # How to Add Dependencies or Outputs We have executed a stage, but later notice that some of the files/directories it uses as dependencies or creates as outputs are missing from `dvc.yaml`... -To add files or directories as dependencies or outputs -to an existing stage without re-executing it (which can be expensive and is -unnecessary) you can either edit the `dvc.yaml` file directly, or use `dvc run` -with the `-f` and `--no-exec` options to the same end. +To add dependencies or outputs to an existing stage +without re-executing it (which can be expensive and is unnecessary), edit the +`dvc.yaml` file (by hand or using `dvc run` with the `-f --no-exec` options). If some output files already exist in the workspace, you can use -`dvc commit` after the update, to save them to the cache (and to -update `dvc.lock`). +`dvc commit` after the update to save them to the cache and update +`dvc.lock`. ## Example From b6a384dd0f5fc8c69a68eed1c2dfc7738050dd97 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 00:25:39 -0600 Subject: [PATCH 23/25] how-to: rename Stop tracking data -> Reverse mistakes --- content/docs/sidebar.json | 2 +- ...king-data.md => reverse-common-mistakes.md} | 18 +++++++++++------- 2 files changed, 12 insertions(+), 8 deletions(-) rename content/docs/user-guide/how-to/{stop-tracking-data.md => reverse-common-mistakes.md} (52%) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 74475695d1..0621c2a434 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,7 +100,7 @@ "slug": "how-to", "source": false, "children": [ - "stop-tracking-data", + "reverse-common-mistakes", "update-tracked-files", "add-deps-or-outs-to-a-stage" ] diff --git a/content/docs/user-guide/how-to/stop-tracking-data.md b/content/docs/user-guide/how-to/reverse-common-mistakes.md similarity index 52% rename from content/docs/user-guide/how-to/stop-tracking-data.md rename to content/docs/user-guide/how-to/reverse-common-mistakes.md index 4003e99e07..97b21d65fe 100644 --- a/content/docs/user-guide/how-to/stop-tracking-data.md +++ b/content/docs/user-guide/how-to/reverse-common-mistakes.md @@ -1,14 +1,15 @@ -# How to Stop Tracking Data +# How to Reverse Common Mistakes -There are situations where you may want to stop tracking data added previously. -Let's see how it can be done using an example `data.csv` file. +## Stop tracking data + +There are situations where you may want to "un-track" files or directories added +in error to DVC.
-## Click to add the sample data first +## Expand to add a sample data `data.csv` file -Let's `dvc add` a `data.csv` file into an example project, which -creates a `.dvc` file to track the data and adds it to `.gitignore`: +`dvc add` creates a `.dvc` file to track the file, and lists it in `.gitignore`: ```dvc $ dvc add data.csv @@ -21,7 +22,7 @@ $ cat .gitignore
-Let's undo `dvc add` with `dvc remove`. This removes the `.dvc` file (and +Let's undo `dvc add` with `dvc remove`. This deletes the `.dvc` file (and corresponding `.gitignore` entry). The data file is now no longer being tracked after this: @@ -39,3 +40,6 @@ previous versions, if any) from the cache: ```dvc $ dvc gc -w ``` + +> Note that a very similar procedure works for pipeline stages and their +> outputs. From 9ebe6d02680c828834219d0bf1a231c52a2d13af Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 00:36:35 -0600 Subject: [PATCH 24/25] cases: generalize add deps/outs to stages per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-538205452 --- .../how-to/add-deps-or-outs-to-a-stage.md | 21 +++++++++---------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md index 8546a7644f..1b8cebdf9b 100644 --- a/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md +++ b/content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md @@ -6,16 +6,15 @@ dependencies or outputs are missing...' # How to Add Dependencies or Outputs -We have executed a stage, but later notice that some of the files/directories it -uses as dependencies or creates as outputs are missing from `dvc.yaml`... +To add dependencies or outputs to a stage, edit the +`dvc.yaml` file (by hand or using `dvc run` with the `-f --no-exec` flags). +`dvc repro` will execute it and cache the output files when ready. -To add dependencies or outputs to an existing stage -without re-executing it (which can be expensive and is unnecessary), edit the -`dvc.yaml` file (by hand or using `dvc run` with the `-f --no-exec` options). +If the stage has already been executed it and the desired outputs are present in +the workspace, you can avoid `dvc repro` (which can be expensive +and is unnecessary) and use `dvc commit` instead. -If some output files already exist in the workspace, you can use -`dvc commit` after the update to save them to the cache and update -`dvc.lock`. +> Both alternatives update `dvc.lock` accordingly. ## Example @@ -51,9 +50,9 @@ output. To add a missing dependency (`data/raw.csv`) as well as a missing output > `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage > without executing it. -If the `data/raw.csv` or `data/validate` files exist, we also need to use -`dvc commit` to save the newly specified deps and outs to the cache -(and to update the hash values of `deps` and `outs` in `dvc.lock`): +If the `data/raw.csv` or `data/validate` files already exist, we can use +`dvc commit` to cache the newly specified outputs (and to update the `deps` and +`outs` file hashes in `dvc.lock`): ```dvc $ dvc commit From c1d3ad9942209568420b9634e2dcfe3f3ca5c91c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 16:24:23 -0600 Subject: [PATCH 25/25] how-to: rename Common Mistakes -> Stop Tracking Data again per https://github.com/iterative/dvc.org/pull/1914#pullrequestreview-540149404 --- content/docs/sidebar.json | 2 +- .../how-to/reverse-common-mistakes.md | 45 ------------------- 2 files changed, 1 insertion(+), 46 deletions(-) delete mode 100644 content/docs/user-guide/how-to/reverse-common-mistakes.md diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 16baceb24c..f56dcd1933 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,7 +100,7 @@ "slug": "how-to", "source": false, "children": [ - "reverse-common-mistakes", + "stop-tracking-data", "update-tracked-files", "add-deps-or-outs-to-a-stage" ] diff --git a/content/docs/user-guide/how-to/reverse-common-mistakes.md b/content/docs/user-guide/how-to/reverse-common-mistakes.md deleted file mode 100644 index 97b21d65fe..0000000000 --- a/content/docs/user-guide/how-to/reverse-common-mistakes.md +++ /dev/null @@ -1,45 +0,0 @@ -# How to Reverse Common Mistakes - -## Stop tracking data - -There are situations where you may want to "un-track" files or directories added -in error to DVC. - -
- -## Expand to add a sample data `data.csv` file - -`dvc add` creates a `.dvc` file to track the file, and lists it in `.gitignore`: - -```dvc -$ dvc add data.csv - -$ ls -data.csv data.csv.dvc -$ cat .gitignore -/data.csv -``` - -
- -Let's undo `dvc add` with `dvc remove`. This deletes the `.dvc` file (and -corresponding `.gitignore` entry). The data file is now no longer being tracked -after this: - -```dvc -$ dvc remove data.csv.dvc - -$ git status - Untracked files: - data.csv -``` - -You can run `dvc gc` with the `-w` option to remove the data (and all of it's -previous versions, if any) from the cache: - -```dvc -$ dvc gc -w -``` - -> Note that a very similar procedure works for pipeline stages and their -> outputs.