From bd0e2504e7103b51312577f2b947518f5b57ffde Mon Sep 17 00:00:00 2001 From: Ruslan Kuprieiev Date: Thu, 13 Feb 2020 09:07:51 +0200 Subject: [PATCH] add docs for dvc metrics diff (#933) * add docs for dvc metrics diff * nav: add `metrics diff` to sidebar * cmd ref: typos in `metrics diff` * cmd ref: rewrite `metrics diff` ref and and review related concepts throughout docs e.g. "Git reference", "working tree" * cmd ref: update descs, review options, link all metrics subcmds addresses https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348849914 as well as https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348847997 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348858027 * cmd ref: update cmd argument descriptions for `diff` and `metics diff` * metrics diff: big terminology review around the intro of this new command per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348855380 et al. * term: review usage of "hash", "commit hash", "SHA", and "MD5" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348856746 * term: rewrite definition of "workspace" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348857535 * cmd ref: change link from `metrics diff` options to `metrics show` per https://github.com/iterative/dvc.org/pull/933#issuecomment-580033273 * cmd ref: update example in `dvc metrics diff` and similar ones per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348851587 * cmd ref: simplify dvc gc -a option per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540249 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540611 * cmd ref: use "reference" more than "revision" in diff per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350539579 * cmd ref: link term "revision" in diff and `metrics diff` also per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350539579 * term: put Git ref exapmles before term and link per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540611 * cmd ref: friendlier explanation of "tip of default branch" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540726 * cmd ref: use tag name instead of term "the revision" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350541159 * term: revert some "revision"->"reference" changes, and related simplifications per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350541159 * cmd ref: review desc. of `-a` options throughout refs * cmd ref: update diff params per iterative/dvc/pull/3244 * cmd ref: update notes around moving/static Git refs in import and update per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350601759 * revert workspace glossary entry per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350604641 * tutorial: use full name of Deep Dive Tutorial in title and links per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350605629 * user-guide: undo change on "binary" literal for analytics example per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350607861 * use-cases: avoid term "revision" in data-registries per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350608591 * term: revert "hash"->"checksum" in this PR per https://github.com/iterative/dvc.org/pull/933#discussion_r373221327 * cmd ref: "revision"->"commit" in get ref per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350609358 * cmd ref: use correct tag names in checkout examples and double check they still work per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350597257 * diff: remove backquotes adound "HEAD" same as in core repo per https://github.com/iterative/dvc/pull/3244#pullrequestreview-351263562 * cmd ref: don't use link to git reference doc per https://github.com/iterative/dvc.org/pull/933#discussion_r373753614 * cmd ref: don't use term "revision" in diff, prefer "commit" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862100 * cmd ref: no need for word "specific" (or "SHA") in get/import per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862174 * cmd ref: update "project"->"workspace" term and example intros in `dvc install` per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862271 * docs: 2 misc updates per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862315 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862377 * tutorials: update model->"data or model" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862439 et al. * cmd ref: fixed link to `metrics diff` and updated mention of it in `metrics show` per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862509 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862516 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862579 * get-started: typo in pipelines chapter * cmd ref: rewrite paragraph about fixed revision import stages in `update` per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862953 * cmd ref: rewrite p about `repro` rewriting artifacts in cache per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351863154 * cmd ref: rephrase and split p about what to compare and about targets in `status` per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-353489318 * cmd ref: reorg last part of repro desc * cmd ref: restore `git tag` sample output in checkout examples per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350597382 and https://github.com/iterative/dvc.org/issues/961#issuecomment-580397025 * cmd ref: small rewording around tag names * cmd ref: `metrics diff` and `diff` intro updates per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356890960 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356892062 * term: use "version" instead of "revision" in `import` cmd ref per https://github.com/iterative/dvc.org/pull/933#discussion_r377815981 * cmd ref: updates to `metrics diff` (and `diff`) descriptions per https://dvc.org/doc/command-reference/status and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356893342 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356894688 * cmd ref: change "ref" -> "rev" per iterative/dvc/pull/3299 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356892140 * cmd ref: "revision"->"version" in a couple more docs per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356909461 * cmd ref: simplify note about `metrics diff` in `metrics show` per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357004725 * cmd ref: use descriptive exampe-get-started repo tags in `get` examples per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356887194 * term: "commit hash"->"commit SHA hash" to match #962 but may change the decision in that other PR. * cmd ref: improve -a adn -c option descs per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356884181 * cmd ref: remove p about --targets option in metrics diff per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-356894688 * cmd ref: rewrite pa bout fixed revisions/re-importing in `update` per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357220534 * tutorials: use and instead of data "or" models in versioning tut per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357222526 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357222170 * user-guide: restore bullet about `git` in analytics per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357223391 * you don't usually merge tags per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-351862377 * term: don't use "version of repo/project" when referring to commits per https://github.com/iterative/dvc.org/pull/933#discussion_r378604638 * cmd ref: simlpify note about metrics diff in metrics show per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357004725 * cmd ref: and->and/or in checkout sample of versioning tut per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357222170 * user-guide: updated analytics details per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357223391 * cmd ref: restore simpler wording about status -aT desc per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-357224140 * cmd ref: correct (again) the short desc for diff and metrics diff per https://github.com/iterative/dvc.org/pull/933#discussion_r378654316 Co-authored-by: Jorge Orpinel --- public/static/docs/changelog/0.18.md | 4 +- public/static/docs/command-reference/add.md | 8 +- .../static/docs/command-reference/checkout.md | 78 +++++------ .../static/docs/command-reference/config.md | 6 +- public/static/docs/command-reference/diff.md | 81 ++++++------ public/static/docs/command-reference/fetch.md | 17 +-- public/static/docs/command-reference/gc.md | 25 ++-- public/static/docs/command-reference/get.md | 52 ++++---- .../docs/command-reference/import-url.md | 10 +- .../static/docs/command-reference/import.md | 20 +-- .../static/docs/command-reference/install.md | 63 ++++----- .../docs/command-reference/metrics/add.md | 24 ++-- .../docs/command-reference/metrics/diff.md | 122 ++++++++++++++++++ .../docs/command-reference/metrics/index.md | 2 + .../docs/command-reference/metrics/modify.md | 20 +-- .../docs/command-reference/metrics/show.md | 61 +++++---- public/static/docs/command-reference/pull.md | 11 +- public/static/docs/command-reference/push.md | 8 +- .../docs/command-reference/remote/modify.md | 2 +- public/static/docs/command-reference/repro.md | 21 +-- public/static/docs/command-reference/run.md | 2 +- .../static/docs/command-reference/status.md | 54 ++++---- .../static/docs/command-reference/update.md | 13 +- .../static/docs/command-reference/version.md | 12 +- public/static/docs/get-started/add-files.md | 6 +- public/static/docs/get-started/agenda.md | 2 +- public/static/docs/get-started/experiments.md | 2 +- public/static/docs/get-started/import-data.md | 2 +- .../static/docs/get-started/older-versions.md | 2 +- public/static/docs/get-started/pipeline.md | 4 +- public/static/docs/get-started/store-data.md | 7 +- public/static/docs/install/pre-release.md | 6 +- public/static/docs/sidebar.json | 4 + .../docs/tutorials/deep/define-ml-pipeline.md | 2 +- public/static/docs/tutorials/deep/index.md | 2 +- .../docs/tutorials/deep/reproducibility.md | 4 +- public/static/docs/tutorials/pipelines.md | 9 +- public/static/docs/tutorials/versioning.md | 34 ++--- .../understanding-dvc/collaboration-issues.md | 6 +- .../understanding-dvc/related-technologies.md | 8 +- .../docs/understanding-dvc/what-is-dvc.md | 15 ++- .../static/docs/use-cases/data-registries.md | 10 +- .../versioning-data-and-model-files.md | 15 ++- public/static/docs/user-guide/analytics.md | 20 +-- .../docs/user-guide/contributing/core.md | 2 +- .../static/docs/user-guide/dvc-file-format.md | 15 +-- .../docs/user-guide/external-dependencies.md | 2 +- .../docs/user-guide/managing-external-data.md | 14 +- .../docs/user-guide/running-dvc-on-windows.md | 9 +- .../docs/user-guide/updating-tracked-files.md | 2 +- 50 files changed, 512 insertions(+), 408 deletions(-) create mode 100644 public/static/docs/command-reference/metrics/diff.md diff --git a/public/static/docs/changelog/0.18.md b/public/static/docs/changelog/0.18.md index 984691e3a9..92021df8a4 100644 --- a/public/static/docs/changelog/0.18.md +++ b/public/static/docs/changelog/0.18.md @@ -28,8 +28,8 @@ really excited to share the progress with you: - 🙂 **Usability improvements** - DVC interface got more informative and easier to use: - - More heavy operations render dynamic progress bar (e.g. hash computation): - ![](/static/img/0.18-progress.gif) + - More heavy operations render dynamic progress bar (e.g. file hash + computation): ![](/static/img/0.18-progress.gif) - Pipeline visualization via command line. Just run `dvc pipeline show` with `ascii` option and a target: ![](/static/img/0.18-pipeline.gif) diff --git a/public/static/docs/command-reference/add.md b/public/static/docs/command-reference/add.md index c9dc2af363..cb3c3b8b0d 100644 --- a/public/static/docs/command-reference/add.md +++ b/public/static/docs/command-reference/add.md @@ -74,8 +74,8 @@ to work with directory hierarchies with `dvc add`: directory (with default name `dirname.dvc`). Every file in the hierarchy is added to the cache (unless `--no-commit` flag is added), but DVC does not produce individual DVC-files for each file in the directory tree. Instead, - the single DVC-file points to a file in the cache that contains references to - the files in the added hierarchy. + the single DVC-file references a file in the cache that in turn points to the + files in the added hierarchy. In a DVC project, `dvc add` can be used to version control any data artifact (input, intermediate, or output files and @@ -197,8 +197,8 @@ Saving information to 'pics.dvc'. There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this directory structure, but the images are all added to the cache. DVC -prints a message about this, mentioning that `md5` values are computed for each -directory. A single `pics.dvc` DVC-file is generated for the top-level +prints a message about this, mentioning that MD5 hash values are computed for +each directory. A single `pics.dvc` DVC-file is generated for the top-level directory, and it contains: ```yaml diff --git a/public/static/docs/command-reference/checkout.md b/public/static/docs/command-reference/checkout.md index 67f3c8a37b..e5082c87f2 100644 --- a/public/static/docs/command-reference/checkout.md +++ b/public/static/docs/command-reference/checkout.md @@ -33,17 +33,17 @@ The execution of `dvc checkout` does the following: - Scans the DVC-files to compare against the data files or directories in the workspace. DVC knows which data (outputs) match - because their checksums are saved in the `outs` fields inside the DVC-files. - Scanning is limited to the given `targets` (if any). See also options - `--with-deps` and `--recursive` below. + because the corresponding file hash values are saved in the `outs` fields in + the DVC-files. Scanning is limited to the given `targets` (if any). See also + options `--with-deps` and `--recursive` below. - Missing data files or directories, or those that don't match with any DVC-file, are restored from the cache. See options `--force` and `--relink`. -By default, this command tries not to copy files between the cache and the -workspace, using reflinks instead, when supported by the file system. (Refer to -[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).) +By default, this command tries not make copies of cached files in the workspace, +using reflinks instead when supported by the file system (refer to +[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)). The next linking strategy default value is `copy` though, so unless other file link types are manually configured in `cache.type` (using `dvc config`), files will be copied. Keep in mind that having file copies doesn't present much of a @@ -102,10 +102,9 @@ be pulled from remote storage using `dvc pull`. ## Examples Let's employ a simple workspace with some data, code, ML models, -pipeline stages, as well as a few Git tags, such as our -[get started example repo](https://github.com/iterative/example-get-started). -Then we can see what happens with `git checkout` and `dvc checkout` as we switch -from tag to tag. +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`git checkout` and `dvc checkout` as we switch from tag to tag.
@@ -120,8 +119,7 @@ $ cd example-get-started
-The workspace looks almost like in this -[pipeline setup](/doc/tutorials/pipelines): +The workspace looks something like this: ```dvc . @@ -131,17 +129,19 @@ The workspace looks almost like in this ├── featurize.dvc ├── prepare.dvc ├── train.dvc -└── src - └── +├── src +│ └── ... +└── train.dvc ``` -We have these tags in the repository that represent different iterations of -solving the problem: +This repository includes the following tags, that represent different variants +of the resulting model: ```dvc $ git tag -baseline-experiment <- first simple version of the model -bigrams-experiment <- use bigrams to improve the model +... +baseline-experiment <- First simple version of the model +bigrams-experiment <- Uses bigrams to improve the model ``` This project comes with a predefined HTTP @@ -153,57 +153,52 @@ files that are under DVC control. The model file checksum ```dvc $ dvc pull -... -Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33' -... $ md5 model.pkl -MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33 +MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951 ``` -What if we want to rewind history, so to speak? The `git checkout` command lets -us checkout at any point in the commit history, or even checkout other tags. It +What if we want to "rewind history", so to speak? The `git checkout` command +lets us restore any point in the repository history, including any tags. It automatically adjusts the files, by replacing file content and adding or deleting files as necessary. ```dvc -$ git checkout baseline -Note: checking out 'baseline'. -... -HEAD is now at 40cc182... +$ git checkout baseline-experiment # Stage where model is first created ``` Let's check the `model.pkl` entry in `train.dvc` now: ```yaml outs: - md5: a66489653d1b6a8ba989799367b32c43 - path: model.pkl + - md5: 43630cce66a2432dcecddc9dd006d0a7 + path: model.pkl ``` But if you check `model.pkl`, the file hash is still the same: ```dvc $ md5 model.pkl -MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33 +MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951 ``` This is because `git checkout` changed `featurize.dvc`, `train.dvc`, and other DVC-files. But it did nothing with the `model.pkl` and `matrix.pkl` files. Git -doesn't track those files, DVC does, so we must do this: +doesn't track those files; DVC does, so we must do this: ```dvc $ dvc fetch $ dvc checkout + $ md5 model.pkl -MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43 +MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7 ``` -What happened is that DVC went through the sole existing DVC-file and adjusted -the current set of files to match the `outs` of that stage. `dvc fetch` is run -once to download missing data from the remote storage to the cache. -Alternatively, we could have just run `dvc pull` in this case to automatically -do `dvc fetch` + `dvc checkout`. +What happened is that DVC went through the DVC-files and adjusted the current +set of files to match the `outs` in them. `dvc fetch` is run this once to +download missing data from the remote storage to the cache. +(Alternatively, we could have just run `dvc pull` to do `dvc fetch` + +`dvc checkout` in one step.) ## Example: Automating DVC checkout @@ -223,13 +218,10 @@ running `dvc checkout` when needed. again: ```dvc -$ git checkout bigrams -Previous HEAD position was d171a12 add evaluation stage -HEAD is now at d092b42 try using bigrams -Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33'. +$ git checkout bigrams-experiment # Has the latest model version $ md5 model.pkl -MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33 +MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951 ``` Previously this took two commands, `git checkout` followed by `dvc checkout`. We diff --git a/public/static/docs/command-reference/config.md b/public/static/docs/command-reference/config.md index 2481693ebe..554529a754 100644 --- a/public/static/docs/command-reference/config.md +++ b/public/static/docs/command-reference/config.md @@ -115,9 +115,9 @@ for more details.) This section contains the following options: Due to the way DVC handles linking between the data files in the cache and their counterparts in the workspace, it's easy to accidentally - corrupt the cached version of a file by editing or overwriting it. Turning - this config option on forces you to run `dvc unprotect` before updating a - file, providing an additional layer of security to your data. + corrupt the cached file by editing or overwriting it. Turning this config + option on forces you to run `dvc unprotect` before updating a file, providing + an additional layer of security to your data. We highly recommend enabling this option when `cache.type` is set to `hardlink` or `symlink`. diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index b1a1665895..c4d750dde3 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -1,10 +1,8 @@ # diff -Show differences between two versions of the DVC project. It can be -narrowed down to specific target files and directories under DVC control. - -> This command requires that the project is a [Git](https://git-scm.com/) -> repository. +Show changes between commits in the DVC repository, or between a +commit and the workspace. The comparison can be narrowed down to +specific target files/directories tracked by DVC. ## Synopsis @@ -12,38 +10,32 @@ narrowed down to specific target files and directories under DVC control. usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref] positional arguments: - a_ref Git reference from which diff calculates - b_ref Git reference until which diff calculates, if omitted diff - shows the difference between current HEAD and a_ref + a_rev Old Git commit to compare (defaults to HEAD) + b_rev New Git commit to compare (defaults to the + current workspace) ``` ## Description -Given two Git commit references (commit hash, branch or tag name, etc) `a_ref` -and `b_ref`, this command shows a comparative summary of basic statistics: how -many files were deleted/changed, and the file size differences. - -> Note that `dvc diff` does not show the line-to-line comparison among the -> target files in each revision, like `git diff` or -> [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This is because the -> data data tracked by DVC can come in many possible formats e.g. structured -> text, or binary blobs, etc. +Given two commit SHA hashes, branch or tag names, etc. +([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this +command shows a comparative summary of basic statistics: how many files were +deleted/changed, and the file size differences. -> For an example on how to create line-to-line text file comparison, refer to -> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256) -> in our GitHub repository. +> Note that `dvc diff` does not show the line-to-line comparisons like +> `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This +> is because the data data tracked by DVC comes in many formats such as +> structured text, binary blobs, etc. For an example on how to create +> line-to-line text file comparison, refer to +> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256). -If the `-t` option is used, the diff is limited to the `TARGET` file or -directory specified. - -Note that `dvc diff` does not have an effect when the repository is not tracked -by the Git SCM, for example when `dvc init` was used with the `--no-scm` option. +`dvc diff` does not have an effect when the repository is not tracked by Git, +for example when `dvc init` was used with the `--no-scm` option. ## Options -- `-t TARGET`, `--target TARGET` - path to a data file or directory. If not - specified, compares all files and directories that are under DVC control in - the workspace. +- `-t TARGET`, `--target TARGET` - path to a data file or directory to limit + diff for. - `-h`, `--help` - prints the usage/help message, and exit. @@ -64,8 +56,9 @@ For these examples we can use the chapters in our Start by cloning our example repo if you don't already have it. Then move into the repo and checkout the -[version](https://github.com/iterative/example-get-started/releases/tag/3-add-file) -corresponding to the _Add Files_ chapter: +[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file) +tag, corresponding to the [Add Files](/doc/get-started/add-files) _Get Started_ +chapter: ```dvc $ git clone https://github.com/iterative/example-get-started @@ -83,13 +76,14 @@ Preparing to download data from 'https://remote.dvc.org/get-started' -## Example: Previous version of the same branch +## Example: Previous commit in the same branch + +The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from +which to calculate the difference. The "until" reference (`b_ref`) defaults to +`HEAD` (current Git commit). -The minimal `dvc diff` command only includes the A reference (`a_ref`) from -which the difference is to be calculated. The B reference (`b_ref`) defaults to -Git `HEAD` (the currently checked out version). To find the general differences -with the very previous committed version of the project, we can use the `HEAD^` -Git reference. +To see the difference with the very previous commit of the project, we can use +`HEAD^` as `a_ref`: ```dvc $ dvc diff HEAD^ @@ -101,7 +95,7 @@ diff for 'data/data.xml' added file with size 37.9 MB ``` -## Example: Specific targets across Git references +## Example: Specific targets across Git commits We can base this example in the [Metrics](/doc/get-started/metrics) and [Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get @@ -131,8 +125,8 @@ example repo. -To see the difference in `model.pkl` among these versions, we can run the -following command. +To see the difference in `model.pkl` among these tags, we can run the following +command. ```dvc $ dvc diff -t model.pkl baseline-experiment bigrams-experiment @@ -145,7 +139,8 @@ diff for 'model.pkl' ``` The output from this command confirms that there's a difference in the -`model.pkl` file between the 2 Git references we indicated. +`model.pkl` file between the 2 Git commits (tags `baseline-experiment` and +`bigrams-experiment`) we indicated. ### What about directories? @@ -193,6 +188,6 @@ diff for 'data/prepared' ``` The command above checks whether there have been any changes to the -`data/prepared` directory after the `5-preparation` version (since the `b_ref` -is the current version, `HEAD` by default). The output tells us that there have -been no changes to that directory (or to any other file). +`data/prepared` directory after the `5-preparation` tag (since the `b_ref` is +`HEAD` by default). The output tells us that there have been no changes to that +directory (or to any other file). diff --git a/public/static/docs/command-reference/fetch.md b/public/static/docs/command-reference/fetch.md index a0ac224a07..6c157c9205 100644 --- a/public/static/docs/command-reference/fetch.md +++ b/public/static/docs/command-reference/fetch.md @@ -94,10 +94,11 @@ specified in DVC-files currently in the project are considered by `dvc fetch` fetched. The default value is `4 * cpu_count()`. For SSH remotes default is just 4. -- `-a`, `--all-branches` - fetch cache for all Git branches, not just the active - one. This means DVC may download files needed to reproduce different versions - of a DVC-file ([experiments](/doc/get-started/experiments)), not just the - current one. +- `-a`, `--all-branches` - fetch cache for all Git branches instead of just the + current workspace. This means DVC may download files needed to reproduce + different versions of a DVC-file + ([experiments](/doc/get-started/experiments)), not just the ones currently in + the workspace. - `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note that both options can be combined, for example using the `-aT` flag. @@ -115,9 +116,9 @@ specified in DVC-files currently in the project are considered by `dvc fetch` ## Examples Let's employ a simple workspace with some data, code, ML models, -pipeline stages, as well as a few Git tags, such as our -[get started example repo](https://github.com/iterative/example-get-started). -Then we can see what happens with `dvc fetch` as we switch from tag to tag. +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`dvc fetch` as we switch from tag to tag.
@@ -154,7 +155,7 @@ solving the problem: $ git tag baseline-experiment <- first simple version of the model -bigrams-experiment <- use bigrams to improve the model +bigrams-experiment <- use bigrams to improve the model ``` ## Example: Default behavior diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index 4ba6cf11ab..0509f80dcf 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -24,7 +24,7 @@ There are important things to note when using Git to version the - If the cache/remote holds several versions of the same data, all except the current one will be deleted. - Use the `--all-branches` or `--all-tags` options to avoid collecting data - referenced in the tips of all branches or in all tags, respectively. + referenced in the tips of all branches or all tags, respectively. Unless the `--cloud` (`-c`) option is used, `dvc gc` does not remove data files from any remote. This means that any files collected from the local cache can be @@ -33,15 +33,14 @@ restored using `dvc fetch`, as long as they have previously been uploaded with ## Options -- `-a`, `--all-branches` - keep cached objects referenced from the latest commit - across all Git branches. It should be used if you want to keep data for the - latest experiment revisions. Especially, if you intend to use `dvc gc -c` this - option is much safer. +- `-a`, `--all-branches` - keep cached objects referenced in all Git branches. + Useful for keeping data for all the latest experiment versions. Its + recommended to consider including this option when using `-c` i.e. + `dvc gc -ac`. -- `-T`, `--all-tags` - the same as `-a` above but keeps cache for existing Git - tags. It's useful if tags are used to track "checkpoints" of an experiment or - project. Note that both options can be combined, for example using the `-aT` - flag. +- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's + useful if tags are used to track "checkpoints" of an experiment or project. + Note that both options can be combined, for example using the `-aT` flag. - `-p`, `--projects` - if a single remote or a single cache is shared among different projects (e.g. a configuration like the one described @@ -49,10 +48,10 @@ restored using `dvc fetch`, as long as they have previously been uploaded with specify a list of them (each project is a path) to keep data that is currently referenced from them. -- `-c`, `--cloud` - also remove files in the default remote storage. _This - operation is dangerous._ It removes datasets, models, other files that are not - linked in the current branch/commit (unless `-a` or `-T` is specified). Use - `-r` to specify which remote to collect from (instead of the default). +- `-c`, `--cloud` - also remove files in remote storage. _This operation is + dangerous._ It removes datasets, models, other files that are not linked in + the current commit (unless `-a` or `-T` are also used). The default remote is + used unless a specific one is given with `-r`. - `-r`, `--remote` - name of the remote storage to collect unused objects from if `-c` option is specified. diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 78a541a81c..7844c3eff7 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -56,11 +56,10 @@ name. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - specific - [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - (such as a branch name, a tag, or a commit hash) of the repository to download - the file or directory from. The tip of the default branch is used by default - when this option is not specified. +- `--rev` - commit SHA hash, branch or tag name, etc. (any + [Git revision](https://git-scm.com/docs/revisions)) of the repository to + download the file or directory from. The latest commit in `master` (tip of the + default branch) is used by default when this option is not specified. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. `path` is expected to represent a @@ -120,7 +119,7 @@ $ ls install.sh ``` -### Example: Getting the storage URL of a DVC-tracked file +## Example: Getting the storage URL of a DVC-tracked file We can use `dvc get --show-url` to get the actual location where the final model file from our @@ -128,7 +127,8 @@ file from our stored: ```dvc -$ dvc get https://github.com/iterative/example-get-started model.pkl --show-url +$ dvc get --show-url \ + https://github.com/iterative/example-get-started model.pkl https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 ``` @@ -138,11 +138,12 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 ## Example: Compare different versions of data or model -`dvc get` provides the `--rev` option to specify which version of the repository -to download a data artifact from. It also has the `--out` option to -specify the location to place the artifact within the workspace. Combining these -two options allows us to do something we can't achieve with the regular -`git checkout` + `dvc checkout` process – see for example the +`dvc get` provides the `--rev` option to specify which +[commit](https://git-scm.com/docs/revisions) of the repository to download a +data artifact from. It also has the `--out` option to specify the +location to place the artifact within the workspace. Combining these two options +allows us to do something we can't achieve with the regular `git checkout` + +`dvc checkout` process – see for example the [Get Older Data Version](/doc/get-started/older-versions) chapter of our _Get Started_. @@ -156,16 +157,16 @@ $ git clone git@github.com:iterative/example-get-started.git $ cd example-get-started ``` -If you are familiar with our [Get Started](/doc/get-started) example, you may -know that each chapter has a corresponding -[tag](https://github.com/iterative/example-get-started/tags). Tag `7-train` is -where we train a first version of the example model, and tag `9-bigrams-model` -has an improved model (trained using bigrams). What if we wanted to have both -versions of the model "checked out" at the same time? `dvc get` provides an easy -way to do this: +If you are familiar with our [Get Started](/doc/get-started) project (used in +these examples), you may remember that the chapter where we train a first +version of the model corresponds to the the `baseline-experiment` tag in the +repo. Similarly `bigrams-experiment` points to an improved model (trained using +bigrams). What if we wanted to have both versions of the model "checked out" at +the same time? `dvc get` provides an easy way to do this: ```dvc -$ dvc get . model.pkl --rev 7-train --out model.monograms.pkl +$ dvc get . model.pkl --rev baseline-experiment + --out model.monograms.pkl ``` > Notice that the `url` provided to `dvc get` above is `.`. `dvc get` accepts @@ -173,12 +174,11 @@ $ dvc get . model.pkl --rev 7-train --out model.monograms.pkl The `model.monograms.pkl` file now contains the older version of the model. To get the most recent one, we use a similar command, but with - -`-o model.bigrams.pkl` and `--rev 9-bigrams-model` or even without `--rev` -(since it's the latest version anyway). In fact, in this case using `dvc pull` -with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) should -suffice, downloading the file as just `model.pkl`. We can then rename it to make -its version explicit: +`-o model.bigrams.pkl` and `--rev bigrams-experiment` (or even without `--rev` +since that tag has the latest model version anyway). In fact, in this case using +`dvc pull` with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) +should suffice, downloading the file as just `model.pkl`. We can then rename it +to make its variant explicit: ```dvc $ dvc pull train.dvc diff --git a/public/static/docs/command-reference/import-url.md b/public/static/docs/command-reference/import-url.md index e5e7ca8207..d59b65e86a 100644 --- a/public/static/docs/command-reference/import-url.md +++ b/public/static/docs/command-reference/import-url.md @@ -129,15 +129,13 @@ in the [Get Started](/doc/get-started) section.
-### Click and expand to setup the example project - -Follow these instructions before each example below if you actually want to try -them on your system. +### Click and expand to setup example Start by cloning our example repo if you don't already have it. Then move into the repo and checkout the -[version](https://github.com/iterative/example-get-started/releases/tag/2-remote) -corresponding to the [Configure](/doc/get-started/configure) chapter: +[2-remote](https://github.com/iterative/example-get-started/releases/tag/2-remote) +tag, corresponding to the [Configure](/doc/get-started/configure) _Get Started_ +chapter: ```dvc $ git clone https://github.com/iterative/example-get-started diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 30dedc63ba..ac9cb0c17d 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -73,11 +73,10 @@ data artifact from the source repo. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - specific - [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - (such as a branch name, a tag, or a commit hash) of the repository to download - the file or directory from. The tip of the default branch is used by default - when this option is not specified. +- `--rev` - commit SHA hash, branch or tag name, etc. (any + [Git revision](https://git-scm.com/docs/revisions)) of the repository to + download the file or directory from. The latest commit in `master` (tip of the + default branch) is used by default when this option is not specified. > Note that this adds a `rev` field in the import stage that fixes it to this > revision. This can impact the behavior of `dvc update`. (See @@ -128,11 +127,11 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock` subfields under `repo` are used to save the origin and version of the -dependency. +dependency, respectively. ## Example: Fixed revisions & re-importing -To import a specific revision of a data artifact, we may use the +To import a specific version of a data artifact, we may use the `--rev` option: ```dvc @@ -160,9 +159,10 @@ deps: If `rev` is a Git branch or tag (where the commit it points to changes), the data source may have updates at a later time. To bring it up to date if so (and update `rev_lock` in the DVC-file), simply use `dvc update .dvc`. If -`rev` is a specific commit (does not change), `dvc update` will never have an -effect on the import stage. You may **re-import** a different commit instead, by -using `dvc import` again with a different (or without) `--rev`. For example: +`rev` is a specific commit SHA hash (does not change), `dvc update` will never +have an effect on the import stage. You may **re-import** a different commit +instead, by using `dvc import` again with a different (or without) `--rev`. For +example: ```dvc $ dvc import --rev master \ diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index 7fe16be3af..2e258e557e 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -21,26 +21,26 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present). Namely: -**Checkout**: For any given branch or tag, `git checkout` retrieves the -[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The -project's DVC-files in turn refer to data stored in cache, but not -necessarily in the workspace. Normally, it would be necessary to -run `dvc checkout` to synchronize workspace and DVC-files. +**Checkout**: For any commit SHA hash, branch or tag, `git checkout` retrieves +the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. +The project's DVC-files in turn refer to data stored in cache, but +not necessarily in the workspace. Normally, it would be necessary +to run `dvc checkout` to synchronize workspace and DVC-files. This hook automates running `dvc checkout`. -**Commit/Reproduce**: When committing a change to the Git repository, that -change possibly produces new data files not yet in cache, which requires running -`dvc commit` to store them. Or the change might require reproducing the -corresponding [pipeline](/doc/command-reference/pipeline) (with `dvc repro`) to -regenerate the project's results (which implicitly commits them to DVC as well). +**Commit/Reproduce**: When committing a change with Git, that change possibly +produces new data files not yet in cache, which requires running `dvc commit` to +store them. Or the change might require reproducing the corresponding +[pipeline](/doc/command-reference/pipeline) (with `dvc repro`) to regenerate the +project's results (which implicitly commits them to DVC as well). This hook automates reminding the user to run either `dvc commit` or `dvc repro`, as needed. -**Push**: While publishing changes to the Git remote repo with `git push`, it -easy to forget that the `dvc push` command is necessary to upload new or updated -data files and directories under DVC control to +**Push**: While publishing changes to the Git remote with `git push`, its easy +to forget that the `dvc push` command is necessary to upload new or updated data +files and directories under DVC control to [remote storage](/doc/command-reference/remote). This hook automates `dvc push`. @@ -120,20 +120,18 @@ $ dvc pull --all-branches --all-tags
-## Example: Checkout both DVC and Git +## Example: Checkout both Git and DVC -Let's start our exploration with the impact of `dvc install` on the -`dvc checkout` command. Remember that switching from one Git commit to another -(with `git checkout`) changes the set of -[DVC-files](/doc/user-guide/dvc-file-format) in the project. This changes the -set of data files that should be located in the workspace (which can be achieved -with `dvc checkout`). +Switching from one Git commit to another (with `git checkout`) may change the +set of [DVC-files](/doc/user-guide/dvc-file-format) in the +workspace. This would mean that the currently present data files +and directories no longer matches project's version (which can be fixed with +`dvc checkout`). Let's first list the available tags in the _Get Started_ repo: ```dvc $ git tag - 0-empty 1-initialize 2-remote @@ -145,8 +143,7 @@ $ git tag 8-evaluation 9-bigrams-model 10-bigrams-experiment -baseline-experiment -bigrams-experiment +... ``` These tags are used to mark points in the development of the project, and to @@ -160,7 +157,6 @@ Note: checking out '6-featurization'. You are in 'detached HEAD' state... $ dvc status - featurize.dvc: changed outs: modified: data/features @@ -168,7 +164,6 @@ featurize.dvc: $ dvc checkout $ dvc status - Data and pipelines are up to date. ``` @@ -231,19 +226,17 @@ Look carefully at this output and it is clear that the `dvc checkout` command has indeed been run. As a result the workspace is up to date with the data files matching what is referenced by the DVC-files. -## Example: Showing DVC status on Git commit +## Example: Showing DVC status when committing with Git -The other hook installed by `dvc install` runs before `git commit` operation. To -see see what that does, start with the same workspace, making sure it is not in -the _detached HEAD_ state from the previous example by first running -`git checkout master`. +To follow this example, start with the same workspace as before, making sure it +is not in a _detached HEAD_ state by running `git checkout master`. If we simply edit one of the code files: ```dvc $ vi src/featurization.py -$ git commit -a -m "modified featurization" +$ git commit -a -m "Modified featurization" featurize.dvc: changed deps: @@ -280,7 +273,7 @@ Data and pipelines are up to date. ``` After reproducing this pipeline up to the "evaluate" stage, the data files are -in sync with the code/config files, but we must now commit the changes to the -Git repo. Looking closely we see that `dvc status` is used again, informing us -that the data files are synchronized with the -`Data and pipelines are up to date.` message. +in sync with the code/config files, but we must now commit the changes with Git. +Looking closely we see that `dvc status` is used again, informing us that the +data files are synchronized with the `Data and pipelines are up to date.` +message. diff --git a/public/static/docs/command-reference/metrics/add.md b/public/static/docs/command-reference/metrics/add.md index d610af862c..8b571b5a3e 100644 --- a/public/static/docs/command-reference/metrics/add.md +++ b/public/static/docs/command-reference/metrics/add.md @@ -20,8 +20,8 @@ defines the given `path` as an output, marking `path` as a Note that outputs can also be marked as metrics via the `-m` or `-M` options of the `dvc run` command. -While any text file can be tracked as a metric file, we recommend using `TSV`, -`CSV`, or `JSON` formats. DVC provides a way to parse those formats to get to a +While any text file can be tracked as a metric file, we recommend using TSV, +CSV, or JSON formats. DVC provides a way to parse those formats to get to a specific value, if the file contains multiple metrics. See `dvc metrics show` for more details. @@ -30,22 +30,22 @@ for more details. ## Options -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the corresponding DVC-file, and used by `dvc metrics show` to determine how to handle displaying metrics. - `raw` is the default when no type is provided. It means that no additional - parsing is applied, and `--xpath` is ignored. `htsv`/`hcsv` are the same as - `tsv`/`csv`, but the values in the first row of the file will be used as the - field names and should be used to address columns in the `--xpath` option. + `raw` means that no additional parsing is applied, and `--xpath` is ignored. + `htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of + the file will be used as the field names and should be used to address columns + in the `--xpath` option. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if the metric file contains multiple numbers and you - need to get a only one of them. Only a single path is allowed. It will be - saved into the corresponding DVC-file, and used by `dvc metrics show` to - determine how to handle displaying metrics. The accepted value depends on the - metric file type (`--type` option): + want to use only one of them. Only a single path is allowed. It will be saved + into the corresponding DVC-file, and used by `dvc metrics show` to determine + how to display metrics. The accepted value depends on the metric file type + (`--type` option): - For `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md new file mode 100644 index 0000000000..23a2fc3052 --- /dev/null +++ b/public/static/docs/command-reference/metrics/diff.md @@ -0,0 +1,122 @@ +# metrics diff + +Show changes in [metrics](/doc/command-reference/metrics#description) between +commits in the DVC repository, or between a commit and the +workspace. + +## Synopsis + +```usage +usage: dvc metrics diff [-h] [-q | -v] + [--targets [TARGETS [TARGETS ...]]] + [-t TYPE] [-x XPATH] [-R] [--show-json] + [a_ref] [b_ref] + +positional arguments: + a_rev Old Git commit to compare (defaults to HEAD) + b_rev New Git commit to compare (defaults to the + current workspace) +``` + +## Description + +This command means to provide a quick way to compare results from your previous +experiments with the current results of your pipeline, as long as you're using +metrics that DVC is aware of (see `dvc metrics add`). Run without arguments, +this command compares all existing metric files currently present in the +workspace (uncommitted changes) with the latest committed version. + +The differences shown by this command include the new value, and numeric +difference (delta) from the previous value of metrics (with 3-digit accuracy). +They're calculated between two commits (hash, branch, tag, or any +[Git revision](https://git-scm.com/docs/revisions)) for all metrics in the +project, found by examining all of the +[DVC-files](/doc/user-guide/dvc-file-format) in both references. + +## Options + +- `--targets` - specific metric files or directories to calculate metrics + differences for. If omitted (default), this command uses all metric files + found in both Git references. + +- `-R`, `--recursive` - determines the metric files to use by searching each + target directory and its subdirectories for DVC-files to inspect. `targets` is + expected to contain one or more directories for this option to have effect. + +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine + how to parse and format metics for display. See `dvc metrics show` for more + details. + + This option will override `type` and `xpath` defined in the corresponding + DVC-file. If no `type` is provided or found in the DVC-file, DVC will try to + detect it based on file extension. + +- `-x`, `--xpath` - specify a path within a metric file to show changes for a + specific metric value only. Should be used if the metric file contains + multiple numbers and you want to use only one of them. Only a single path is + allowed. It will override `xpath` defined in the corresponding DVC-file. See + `dvc metrics show` for more details. + +- `--show-json` - prints the command's output in easily parsable JSON format, + instead of a human-readable table. + +- `-h`, `--help` - prints the usage/help message, and exit. + +- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no + problems arise, otherwise 1. + +- `-v`, `--verbose` - displays detailed tracing information. + +## Examples + +Let's employ a simple workspace with some data, code, ML models, +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`dvc install` in different situations. + +
+ +### Click and expand to setup the project + +Start by cloning our example repo if you don't already have it: + +```dvc +$ git clone https://github.com/iterative/example-get-started +$ cd example-get-started +``` + +
+ +Notice that we have an `auc.metric` metric file: + +``` +$ cat auc.metric +0.602818 +``` + +Now let's mock a change in our AUC metric: + +``` +$ echo '0.5' > auc.metric +``` + +To see the change, let's run `dvc metrics diff`. This compares our current +workspace (including uncommitted local changes) metrics to what we +had in the previous commit: + +``` +$ git diff +--- a/auc.metric ++++ b/auc.metric +@@ -1 +1 @@ +-0.602818 ++0.5 + +$ dvc metrics diff + Path Metric Value Change +auc.metric 0.500 -0.103 +``` + +> Note that metric files are typically versioned with Git, so we can use both +> `git diff` and `dvc metrics diff` to understand their changes, as seen above. diff --git a/public/static/docs/command-reference/metrics/index.md b/public/static/docs/command-reference/metrics/index.md index 538b474421..4abc570c91 100644 --- a/public/static/docs/command-reference/metrics/index.md +++ b/public/static/docs/command-reference/metrics/index.md @@ -3,6 +3,7 @@ A set of commands to collect and display project metrics: [add](/doc/command-reference/metrics/add), [show](/doc/command-reference/metrics/show), +[diff](/doc/command-reference/metrics/diff), [modify](/doc/command-reference/metrics/modify), and [remove](/doc/command-reference/metrics/remove). @@ -30,6 +31,7 @@ the best performing experiment. [Add](/doc/command-reference/metrics/add), [show](/doc/command-reference/metrics/show), +[diff](/doc/command-reference/metrics/diff), [modify](/doc/command-reference/metrics/modify), and [remove](/doc/command-reference/metrics/remove) commands are available to set up and manage DVC project metrics. diff --git a/public/static/docs/command-reference/metrics/modify.md b/public/static/docs/command-reference/metrics/modify.md index bfe6297881..dcd5fb657d 100644 --- a/public/static/docs/command-reference/metrics/modify.md +++ b/public/static/docs/command-reference/metrics/modify.md @@ -33,22 +33,22 @@ ERROR: failed to modify metric file settings - ## Options -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the corresponding DVC-file, and used by `dvc metrics show` to determine how to handle displaying metrics. - `raw` is the default when no type is provided. It means that no additional - parsing is applied, and `--xpath` is ignored. `htsv`/`hcsv` are the same as - `tsv`/`csv`, but the values in the first row of the file will be used as the - field names and should be used to address columns in the `--xpath` option. + `raw` means that no additional parsing is applied, and `--xpath` is ignored. + `htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of + the file will be used as the field names and should be used to address columns + in the `--xpath` option. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if the metric file contains multiple numbers and you - need to get a only one of them. Only a single path is allowed. It will be - saved into the corresponding DVC-file, and used by `dvc metrics show` to - determine how to handle displaying metrics. The accepted value depends on the - metric file type (`--type` option): + want to use only one of them. Only a single path is allowed. It will be saved + into the corresponding DVC-file, and used by `dvc metrics show` to determine + how to display metrics. The accepted value depends on the metric file type + (`--type` option): - For `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. diff --git a/public/static/docs/command-reference/metrics/show.md b/public/static/docs/command-reference/metrics/show.md index b41606b45f..919828c3ac 100644 --- a/public/static/docs/command-reference/metrics/show.md +++ b/public/static/docs/command-reference/metrics/show.md @@ -22,47 +22,45 @@ show those specific metric files instead. With the `-a` or`-T` options, this command shows the different metrics values across all Git branches or tags, respectively. -The optional `targets` argument can contain several metric files. With the `-R` -option, a target can even be a directory, so that DVC recursively shows all -metric files in it. +The optional `targets` argument can contain one or more metric files. With the +`-R` option, some of the target can even be directories, so that DVC recursively +shows all metric files inside. -Providing a `type` (`-t` option) overwrites the full metric specification (both +Providing a `type` (`-t` option) overrides the full metric specification (both `type` and `xpath` fields) defined in the -[DVC-file](/doc/user-guide/dvc-file-format) (usually set originally with the -`dvc metrics modify` command). +[DVC-file](/doc/user-guide/dvc-file-format) (with `dvc metrics modify`, +typically). If `type` (via `-t`) is not specified and only `xpath` (`-x` option) is, only -the `xpath` field is overwritten in its DVC-file. (DVC will first try to read +the `xpath` field from the DVC-file is overridden. (DVC will first try to read `type` from the DVC-file, but it can be automatically detected by the file extension.) -> Alternatively, see `dvc metrics modify` command to learn how to apply `-t` and -> `-x` permanently. +> See `dvc metrics modify` to learn how to apply `-t` and `-x` permanently. + +An alternative way to display metrics is the `dvc metrics diff` command, which +compares them with a previous version. ## Options -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the - corresponding DVC-file, and used by `dvc metrics show` to determine how to - handle displaying metrics. +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine + how to parse and format metics for display. - `raw` is the default when no type is provided. It means that no additional - parsing is applied, and `--xpath` is ignored. `htsv`/`hcsv` are the same as - `tsv`/`csv`, but the values in the first row of the file will be used as the - field names and should be used to address columns in the `--xpath` option. + `raw` means that no additional parsing is applied, and `--xpath` is ignored. + `htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of + the file will be used as the field names and should be used to address columns + in the `--xpath` option. - This option along with `--xpath` below takes precedence over the `type` and - `xpath` specified in the corresponding DVC file. If this parameter is not - given, the type can be detected by the file extension automatically if the - type is supported. If any other value is specified, it is ignored and - defaulted back to `raw`. + This option will override `type` and `xpath` defined in the corresponding + DVC-file. If no `type` is provided or found in the DVC-file, DVC will try to + detect it based on file extension. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if the metric file contains multiple numbers and you - need to get a only one of them. Only a single path is allowed. It will be - saved into the corresponding DVC-file, and used by `dvc metrics show` to - determine how to handle displaying metrics. The accepted value depends on the - metric file type (`--type` option): + want to use only one of them. Only a single path is allowed. It will override + `xpath` defined in the corresponding DVC-file. The accepted value depends on + the metric file type (`--type` option): - For `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. @@ -82,12 +80,13 @@ extension.) overwrite it for the current command run only – It may fail to produce any results or parse files that are not in a corresponding format in this case. -- `-a`, `--all-branches` - get and print metric file contents across all Git - branches. It can be used to compare different experiments. +- `-a`, `--all-branches` - print metric file contents in all Git branches + instead of just those present in the current workspace. It can be used to + compare different experiments. -- `-T`, `--all-tags` - get and print metric file contents across all Git tags. - Similar to `-a` above. Note that both options can be combined, for example - using the `-aT` flag. +- `-T`, `--all-tags` - print metric file contents in all Git tags. Similar to + `-a` above. Note that both options can be combined, for example using the + `-aT` flag. - `-R`, `--recursive` - determines the metric files to show by searching each target directory and its subdirectories for DVC-files to inspect. `targets` is diff --git a/public/static/docs/command-reference/pull.md b/public/static/docs/command-reference/pull.md index ea3b48c5f8..131bebaa4e 100644 --- a/public/static/docs/command-reference/pull.md +++ b/public/static/docs/command-reference/pull.md @@ -39,9 +39,9 @@ on how to configure a remote. With no arguments, just `dvc pull` or `dvc pull --remote REMOTE`, it downloads only the files (or directories) missing from the workspace by searching all [DVC-files](/doc/user-guide/dvc-file-format) currently in the -project. It will not download files associated with earlier -versions or branches of the repository if using Git, nor will it download files -that have not changed. +project. It will not download files associated with earlier commits +in the repository (if using Git), nor will it download files that +have not changed. The command `dvc status -c` can list files referenced in current DVC-files, but missing in the cache. It can be used to see what files `dvc pull` @@ -65,8 +65,9 @@ reflinks or hardlinks to put it in the workspace without copying. See (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to download by examining - DVC-files in all Git branches of the project repository (if using Git). It's - useful if branches are used to track experiments or project checkpoints. + DVC-files in all Git branches instead of just those present in the current + workspace. It's useful if branches are used to track experiments or project + checkpoints. - `-T`, `--all-tags` - the same as `-a`, `--all-branches` but Git tags are used to save different experiments or project checkpoints. Note that both options diff --git a/public/static/docs/command-reference/push.md b/public/static/docs/command-reference/push.md index c1e7271a8f..0b26154855 100644 --- a/public/static/docs/command-reference/push.md +++ b/public/static/docs/command-reference/push.md @@ -55,8 +55,8 @@ configure a remote. With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads only the files (or directories) that are new in the local repository to remote -storage. It will not upload files associated with earlier versions or branches -of the project directory, nor will it upload files that have not +storage. It will not upload files associated with earlier commits in the +repository (if using Git), nor will it upload files that have not changed. The `dvc status -c` command can list files tracked by DVC that are new in the @@ -77,8 +77,8 @@ to push. (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to upload by examining DVC-files - in all Git branches of the project repository (if using Git). It's useful if - branches are used to track experiments or project checkpoints. + in all Git branches instead of just those present in the current workspace. + It's useful if branches are used to track experiments or project checkpoints. - `-T`, `--all-tags` - the same as `-a`, `--all-branches`, but Git tags are used to save different experiments or project checkpoints. Note that both options diff --git a/public/static/docs/command-reference/remote/modify.md b/public/static/docs/command-reference/remote/modify.md index 1d20b67de2..6cfad99443 100644 --- a/public/static/docs/command-reference/remote/modify.md +++ b/public/static/docs/command-reference/remote/modify.md @@ -182,7 +182,7 @@ these settings, you could use the following options: > identifiable by `id` (AWS Canonical User ID), `emailAddress` or `uri` > (predefined group). - > **References**: + > **Sources** > > - [ACL Overview - Permissions](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#permissions) > - [Put Object ACL](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectAcl.html) diff --git a/public/static/docs/command-reference/repro.md b/public/static/docs/command-reference/repro.md index a5c500f0b0..a25078d862 100644 --- a/public/static/docs/command-reference/repro.md +++ b/public/static/docs/command-reference/repro.md @@ -21,9 +21,9 @@ positional arguments: `dvc repro` provides an way to regenerate data pipeline results, by restoring the dependency graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) implicitly defined -by [stage files](/doc/command-reference/run) (DVC-files with dependencies) that -are found in the project. The commands defined in these stages can -then be executed in the correct order, reproducing pipeline results. +by the [stage files](/doc/command-reference/run) (DVC-files with dependencies) +that are found in the project. The commands defined in these stages +can then be executed in the correct order, reproducing pipeline results. > Pipeline stages are typically defined using the `dvc run` command, while > initial data dependencies can be registered by the `dvc add` command. @@ -38,15 +38,17 @@ command: by specifying stage file `targets`, or by using the `--single-item`, If specific [DVC-files](/doc/user-guide/dvc-file-format) (`targets`) are omitted, `Dvcfile` will be assumed. +`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data +files, intermediate or final results. + By default, this command recursively searches in pipeline stages, starting from the `targets`, to determine which ones have changed. Then it executes the -corresponding commands.
Note that DVC removes cached outputs +corresponding commands.
Note that DVC removes cached outputs before running the stages that produce them. -`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data -files, intermediate or final results. It saves all the data files, intermediate -or final results into the DVC cache (unless `--no-commit` option is -specified), and updates stage files with the new checksum information. +It saves all the data files, intermediate or final results into the DVC +cache (unless the `--no-commit` option is used), and updates the hash +values of changed dependencies and outputs in the corresponding stage files. ### Parallel stage execution @@ -242,7 +244,8 @@ Saving information to 'Dvcfile'. ``` You can now check that `Dvcfile` and `count.txt` have been updated with the new -information, new `md5` checksums and a new result respectively. +information and updated dependency/output file checksums, and a new result, +respectively. ## Example: Downstream diff --git a/public/static/docs/command-reference/run.md b/public/static/docs/command-reference/run.md index 6bf323c083..15c6d936fb 100644 --- a/public/static/docs/command-reference/run.md +++ b/public/static/docs/command-reference/run.md @@ -132,7 +132,7 @@ data pipeline (e.g. random numbers, time functions, hardware dependency, etc.) - `--no-exec` - create a stage file, but do not execute the `command` defined in it, nor take dependencies or outputs under DVC control. In the DVC-file - contents, the `md5` hash sums will be empty; They will be populated the next + contents, the file hash values will be empty; They will be populated the next time this stage is actually executed. This is useful if, for example, you need to build a pipeline (dependency graph) first, and then run it all at once. diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 559d149d21..0b43fe9554 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -1,9 +1,9 @@ # status Show changes in the project -[pipelines](/doc/command-reference/pipeline), as well as mismatches either -between the cache and workspace files, or between the -cache and remote storage. +[pipelines](/doc/command-reference/pipeline), as well as file mismatches either +between the cache and workspace, or between the cache +and remote storage. ## Synopsis @@ -19,25 +19,26 @@ positional arguments: ## Description `dvc status` searches for changes in the existing pipelines, either showing -which [stages](/doc/command-reference/run) have changed in the workspace and -must be reproduced (with `dvc repro`), or differences between cache vs. remote -storage (meaning `dvc push` or `dvc pull` should be run to synchronize them). -The two modes, _local_ and _cloud_ are triggered by using the `--cloud` or -`--remote` options: +which [stages](/doc/command-reference/run) have changed in the workspace +(including uncommitted local changes) and must be reproduced (with `dvc repro`), +or differences between cache vs. remote storage (meaning `dvc push` +or `dvc pull` should be run to synchronize them). The two modes, _local_ and +_cloud_ are triggered by using the `--cloud` or `--remote` options: | Mode | CLI Option | Description | | ------ | ---------- | --------------------------------------------------------------------------------------------------------------------------- | | local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) | | remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. | -| remote | `--cloud` | Comparisons are made between the cache, and the default remote, defined with `dvc remote --default` command. | +| remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. | -DVC determines data and code files to compare by analyzing all -[DVC-files](/doc/user-guide/dvc-file-format) in the project -(`--all-branches` and `--all-tags` in the `cloud` mode compare multiple -workspace versions). The comparison can be limited to specific DVC-files by -listing them as `targets`. Changes are reported only against the given -`targets`. When combined with the `--with-deps` option, a search is made for -changes in other stages that affect the target. +DVC determines which data and code files to compare by analyzing all +[DVC-files](/doc/user-guide/dvc-file-format) in the workspace (the +`--all-branches` and `--all-tags` options compare multiple workspace versions). + +The comparison can be limited to certain DVC-files only, by listing them as +`targets`. (Changes are reported only against these.) When this is combined with +the `--with-deps` option, a search is made for changes in other stages that +affect each target. In the `local` mode, changes are detected through the checksum of every file listed in every DVC-file in question against the corresponding file in the file @@ -53,12 +54,10 @@ This indicates that no differences were detected, and therefore no stages would be executed by `dvc repro`. If instead, differences are detected, `dvc status` lists those changes. For each -DVC-file (stage) with differences, the changes in _dependencies_ and/or -_outputs_ that differ are listed. For each item listed, either the file name or -the checksum is shown, and additionally a status word is shown describing the -changes (described below). This changes list provides a reference to both the -status of a DVC-file, as well as the changes to individual dependencies and -outputs described in it. +DVC-file (stage) with differences, the changes in dependencies +and/or outputs that differ are listed. For each item listed, either +the file name or the checksum is shown, and additionally a status word is shown +describing the changes (described below). - _changed checksum_ means that the DVC-file checksum has changed (e.g. someone manually edited the file). @@ -115,14 +114,13 @@ workspace) is different from remote storage. Bringing the two into sync requires name defined using the `dvc remote` command. Implies `--cloud`. - `-a`, `--all-branches` - compares cache content against all Git branches - instead of checking just the current workspace version. This basically runs - the same status command in all the branches of this repo. The corresponding - branches are shown in the status output. Applies only if `--cloud` or a `-r` - remote is specified. + instead of just the current workspace. This basically runs the same status + command in every branch of this repo. The corresponding branches are shown in + the status output. Applies only if `--cloud` or a `-r` remote is specified. - `-T`, `--all-tags` - compares cache content against all Git tags instead of - checking just the current workspace version. Similar to `-a` above. Note that - both options can be combined, for example using the `-aT` flag. + checking just the current workspace. Similar to `-a` above. Note that both + options can be combined, for example using the `-aT` flag. - `-j JOBS`, `--jobs JOBS` - specifies the number of jobs DVC can use to retrieve information from remote servers. This only applies when the `--cloud` diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index 4da4ca18c4..c4c81f1558 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -26,15 +26,10 @@ Note that import stages are considered always locked, meaning that if you run `dvc repro`, they won't be updated. `dvc update` is the only command that can update them. -Another detail to note is that when the `--rev` (revision) option of -`dvc import` has been used to create an import stage, DVC is not aware of what -kind of -[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this -is, for example a branch or a tag. For typically static references (e.g. tags), -or for SHA commits, `dvc update` will not have any effect on the import. Refer -to the -[re-importing example](/doc/command-reference/import#example-fixed-revisions-re-importing) -to learn how to "update" fixed-revision imports. +`dvc update` will not have an effect on import stages that are fixed to a commit +SHA hash (`rev` field in the DVC-file). Please refer to +[Fixed revisions & re-importing](/doc/command-reference/import#example-fixed-revisions-re-importing) +for more details. ## Options diff --git a/public/static/docs/command-reference/version.md b/public/static/docs/command-reference/version.md index b1f3c8fabe..7d6f0c2221 100644 --- a/public/static/docs/command-reference/version.md +++ b/public/static/docs/command-reference/version.md @@ -16,8 +16,8 @@ system/environment: | Line | Detail | | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | -| `Python version` | Version of the Python being used on the environment in which DVC is initialized | +| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit SHA hash in case of a development version) | +| `Python version` | Version of Python being used in the environment where DVC is initialized | | `Platform` | Information about the operating system of the machine | | [`Binary`](#what-we-mean-by-binary) | Shows whether DVC was installed from a package or from a binary release | | `Package manager` | Name of the package manager used to install DVC if any (`pip`, `conda`, etc) | @@ -53,10 +53,10 @@ The detail of DVC version depends upon the way of installing DVC. that might not be ready to publish yet. Therefore installing using the above command might have issues regarding its usage. So to trace any error reported with this setup, we need to know exactly which version is being used. For this - we rely on a git commit hash that is displayed in this command's output like - this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, and the - following part is the latest `master` branch commit hash. The optional suffix - `.mod` means that code is modified. + we rely on a Git commit SHA hash, that is displayed in this command's output + like this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, + and the following part is the SHA of the tip of the `master` branch. The + optional suffix `.mod` means that code is modified. ### What we mean by "Binary" diff --git a/public/static/docs/get-started/add-files.md b/public/static/docs/get-started/add-files.md index a4ebc9822e..78b404f622 100644 --- a/public/static/docs/get-started/add-files.md +++ b/public/static/docs/get-started/add-files.md @@ -52,9 +52,9 @@ $ ls -R .dvc/cache 04afb96060aad90176268345e10355 ``` -`a304afb96060aad90176268345e10355` from above is an MD5 hash of the `data.xml` -file we just added to DVC. And if you check the `data/data.xml.dvc` DVC-file you -will see that it has this hash inside. +`a304afb96060aad90176268345e10355` above is the hash value of the `data.xml` +file we just added to DVC. If you check the `data/data.xml.dvc` DVC-file, you +will see that it has this string inside. ### Important note on cache performance diff --git a/public/static/docs/get-started/agenda.md b/public/static/docs/get-started/agenda.md index 2ef97bb9ee..d1a6eeffee 100644 --- a/public/static/docs/get-started/agenda.md +++ b/public/static/docs/get-started/agenda.md @@ -12,7 +12,7 @@ $ git clone https://github.com/iterative/example-get-started Otherwise, bear with us and we'll introduce some basic DVC concepts to get the same results together! -The idea of the project is a simplified version of our +The idea for this project is a simplified version of our [Deep Dive Tutorial](/doc/tutorials/deep). It explores the NLP problem of predicting tags for a given StackOverflow question. For example, we might want a classifier that can classify (or predict) posts about Python by tagging them diff --git a/public/static/docs/get-started/experiments.md b/public/static/docs/get-started/experiments.md index a8b852610c..1f7fb337b8 100644 --- a/public/static/docs/get-started/experiments.md +++ b/public/static/docs/get-started/experiments.md @@ -35,7 +35,7 @@ $ git commit -am "Reproduce model using bigrams" > for more details. Now, we have a new `model.pkl` captured and saved. To get back to the initial -version we run `git checkout` along with `dvc checkout` command: +version, we run `git checkout` along with `dvc checkout` command: ``` $ git checkout baseline-experiment diff --git a/public/static/docs/get-started/import-data.md b/public/static/docs/get-started/import-data.md index daa9cf79e2..d911d3c0c9 100644 --- a/public/static/docs/get-started/import-data.md +++ b/public/static/docs/get-started/import-data.md @@ -68,7 +68,7 @@ outs: ``` The `url` and `rev_lock` subfields under `repo` are used to save the origin and -version of the dependency. +[version](https://git-scm.com/docs/revisions) of the dependency, respectively. > Note that `dvc update` updates the `rev_lock` field of the corresponding > DVC-file (when there are changes to bring in). diff --git a/public/static/docs/get-started/older-versions.md b/public/static/docs/get-started/older-versions.md index 5371cc59af..4fa976e442 100644 --- a/public/static/docs/get-started/older-versions.md +++ b/public/static/docs/get-started/older-versions.md @@ -17,7 +17,7 @@ $ dvc checkout train.dvc ``` These two commands will bring the previous model file to its place in the -working tree. +workspace.
diff --git a/public/static/docs/get-started/pipeline.md b/public/static/docs/get-started/pipeline.md index b8d3da245c..d9f0f19390 100644 --- a/public/static/docs/get-started/pipeline.md +++ b/public/static/docs/get-started/pipeline.md @@ -36,8 +36,8 @@ $ dvc push ``` This example is simplified just to show you a basic pipeline, see a more -advanced [example](/doc/tutorials/pipelines) or complete -[tutorial](/doc/tutorials/deep) to create a +advanced [example](/doc/tutorials/pipelines) or +[complete tutorial](/doc/tutorials/pipelines) to create an [NLP](https://en.wikipedia.org/wiki/Natural_language_processing) pipeline end-to-end. diff --git a/public/static/docs/get-started/store-data.md b/public/static/docs/get-started/store-data.md index 7cf79cb40c..1653900a15 100644 --- a/public/static/docs/get-started/store-data.md +++ b/public/static/docs/get-started/store-data.md @@ -35,8 +35,9 @@ $ ls -R /tmp/dvc-storage 04afb96060aad90176268345e10355 ``` -where `a304afb96060aad90176268345e10355` is an MD5 hash of the `data.xml` file, -and if you check the `data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) -you will see that it has this hash inside. +`a304afb96060aad90176268345e10355` above is the hash value of the `data.xml` +file. If you check the `data.xml.dvc` +[DVC-file](/doc/user-guide/dvc-file-format), you will see that it has this +string inside.
diff --git a/public/static/docs/install/pre-release.md b/public/static/docs/install/pre-release.md index 8a1ea93e4c..887011b0c5 100644 --- a/public/static/docs/install/pre-release.md +++ b/public/static/docs/install/pre-release.md @@ -15,9 +15,9 @@ $ pip install git+https://github.com/iterative/dvc ``` > `gitpython` allows the installation process to generate a DVC version using -> the current Git commit SHA. This lets us to distinguish official DVC releases -> (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). For more -> information on our versioning convention, refer to +> the current Git commit SHA hash. This lets us to distinguish official DVC +> releases (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). +> For more information on our versioning convention, refer to > [Components of DVC version](/doc/command-reference/version#components-of-dvc-version). To install a development version for contributing to the project, please refer diff --git a/public/static/docs/sidebar.json b/public/static/docs/sidebar.json index 16c1a17583..ab11ce0e2b 100644 --- a/public/static/docs/sidebar.json +++ b/public/static/docs/sidebar.json @@ -258,6 +258,10 @@ "label": "metrics modify", "slug": "modify" }, + { + "label": "metrics diff", + "slug": "diff" + }, { "label": "metrics remove", "slug": "remove" diff --git a/public/static/docs/tutorials/deep/define-ml-pipeline.md b/public/static/docs/tutorials/deep/define-ml-pipeline.md index f74bc374da..198d0fb5dc 100644 --- a/public/static/docs/tutorials/deep/define-ml-pipeline.md +++ b/public/static/docs/tutorials/deep/define-ml-pipeline.md @@ -69,7 +69,7 @@ need to run `dvc unprotect` or `dvc remove` first (see the If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by `dvc add`, you will see that outputs are tracked in the `outs` field. In this file, only one output is specified. The output contains the data -file path in the repository and its MD5 checksum. This checksum determines a +file path in the repository and its MD5 checksum. This checksum determines the location of the actual content file in the [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), `.dvc/cache`. diff --git a/public/static/docs/tutorials/deep/index.md b/public/static/docs/tutorials/deep/index.md index eb80b091f6..2f2e22d222 100644 --- a/public/static/docs/tutorials/deep/index.md +++ b/public/static/docs/tutorials/deep/index.md @@ -1,4 +1,4 @@ -# Tutorial +# Deep Dive Tutorial This tutorial shows you how to solve a text classification problem using the DVC tool. diff --git a/public/static/docs/tutorials/deep/reproducibility.md b/public/static/docs/tutorials/deep/reproducibility.md index 0b8979a7f1..9cb3997fa8 100644 --- a/public/static/docs/tutorials/deep/reproducibility.md +++ b/public/static/docs/tutorials/deep/reproducibility.md @@ -116,7 +116,7 @@ master: Let's keep the result in the repository. Later we can find out why bigrams don't add value to the current model and change that. -Many DVC-files were changed. This happened due to MD5 checksum changes. +Many DVC-files were changed. This happened due to file checksum changes. ```dvc $ git status -s @@ -232,7 +232,7 @@ CONFLICT (content): Merge conflict in Dvcfile Automatic merge failed; fix conflicts and then commit the result. ``` -The merge has a few conflicts. All of the conflicts are related to MD5 checksum +The merge has a few conflicts. All of the conflicts are related to file checksum mismatches in the branches. You can properly merge conflicts by prioritizing the checksums from the bigrams branch: that is, by removing all checksums of the other branch. diff --git a/public/static/docs/tutorials/pipelines.md b/public/static/docs/tutorials/pipelines.md index 388943b0a3..b1ffb0bcaf 100644 --- a/public/static/docs/tutorials/pipelines.md +++ b/public/static/docs/tutorials/pipelines.md @@ -5,7 +5,8 @@ Let's explore the natural language processing ([NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) problem of predicting tags for a given StackOverflow question. For example, we want a classifier that can predict posts about the Python language by tagging them -`python`. (This is a short version of the [Tutorial](/doc/tutorials/deep).) +`python`. (This is a short version of the +[Deep Dive Tutorial](/doc/tutorials/deep).) In this example, we will focus on building a simple ML [pipeline](/doc/command-reference/pipeline) that takes an archive with @@ -182,7 +183,7 @@ outs: ``` Just like the DVC-file we created earlier with `dvc add`, this stage file uses -checksums that point to the cache to describe and version control dependencies +checksums that point to the cache, to describe and version control dependencies and outputs. Output `data/Posts.xml` file is saved as `.dvc/cache/a3/04afb96060aad90176268345e10355` and linked (or copied) to the workspace, as well as added to `.gitignore`. @@ -193,8 +194,8 @@ stages) we need to apply. This is important when you run `dvc repro` to regenerate the final or intermediate result. Second, hopefully it's clear by now that the actual data is stored in the -`.dvc/cache` directory, each file having a name based on an MD5 hash. This cache -is similar to Git's +`.dvc/cache` directory, each file having a name based on an `md5` hash. This +cache is similar to Git's [objects database](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), but made specifically to handle large data files. diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index d1650a84ab..c93c6c6de3 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -15,8 +15,8 @@ to build a powerful image classifier using a pretty small dataset. We first train a classifier model using 1000 labeled images, then we double the number of images (2000) and retrain our model. We capture both datasets and -classifier results and show how to use `dvc checkout` along with `git checkout` -to switch between different versions. +classifier results and show how to use `dvc checkout` to switch between data +and/or model versions. The specific algorithm used to train and validate the classifier is not important, and no prior knowledge of Keras is required. We'll reuse the @@ -245,7 +245,7 @@ That's it! We have a second model and dataset saved and pointers to them committed with Git. Let's now look at how DVC can help us go back to the previous version if we need to. -## Switching between versions +## Switching between data and/or model versions The DVC command that helps get a specific committed version of data is designed to be similar to `git checkout`. All we need to do in our case is to @@ -263,14 +263,13 @@ $ git checkout v1.0 $ dvc checkout ``` -These commands will restore the working tree to the first snapshot we made: -code, data files, model, all of it. DVC optimizes this operation to avoid -copying data or model files each time. So `dvc checkout` is quick even if you -have large datasets, data files, or models. +These commands will restore the workspace to the first snapshot we made: code, +data files, model, all of it. DVC optimizes this operation to avoid copying data +or model files each time. So `dvc checkout` is quick even if you have large +datasets, data files, or models. -On the other hand, if we want to keep the current version of the code and go -back to the previous dataset only, we can do something like this (make sure that -you don't have uncommitted changes in `data.dvc`): +On the other hand, if we want to keep the current code, but go back to the +previous dataset version, we can do something like this: ```dvc $ git checkout v1.0 data.dvc @@ -278,8 +277,8 @@ $ dvc checkout data.dvc ``` If you run `git status` you'll see that `data.dvc` is modified and currently -points to the `v1.0` of the dataset, while code and model files are from the -`v2.0` version. +points to the `v1.0` version of the dataset, while code and model files are from +the `v2.0` tag.
@@ -288,8 +287,9 @@ points to the `v1.0` of the dataset, while code and model files are from the As we have learned already, DVC keeps data files out of Git (by adjusting `.gitignore`) and puts them into the cache (usually it's a `.dvc/cache` directory inside the repository). Instead, DVC creates -[DVC-files](/doc/user-guide/dvc-file-format). These text files serve as pointers -(MD5 hash) to the cache and are version controlled by Git. +[DVC-files](/doc/user-guide/dvc-file-format). These text files serve as data +placeholders that point to the cached files, and they can be easily version +controlled with Git. When we run `git checkout` we restore pointers (DVC-files) first, then when we run `dvc checkout` we use these pointers to put the right data in the right @@ -312,8 +312,8 @@ When you have a script that takes some data as an input and produces other data outputs, a better way to capture them is to use `dvc run`: > If you tried the commands in the -> [Switching between versions](#switching-between-versions) section, go back to -> the master branch code and data with: +> [Switching between data or model versions](#switching-between-data-or-model-versions) +> section, go back to the master branch code and data with: > > ```dvc > $ git checkout master @@ -374,7 +374,7 @@ hands-on experience with pipelines, and try to apply it here. Don't hesitate to join our [community](/chat) and ask any questions! Another detail we only brushed upon here is the way we captured the -`metrics.csv` metrics file with the `-M` option of `dvc run`. Marking this +`metrics.csv` metric file with the `-M` option of `dvc run`. Marking this output as a metric enables us to compare its values across Git tags or branches (for example, representing different experiments). See `dvc metrics` and [Compare Experiments](/doc/get-started/compare-experiments) to learn more diff --git a/public/static/docs/understanding-dvc/collaboration-issues.md b/public/static/docs/understanding-dvc/collaboration-issues.md index ace3d312f1..40c254351f 100644 --- a/public/static/docs/understanding-dvc/collaboration-issues.md +++ b/public/static/docs/understanding-dvc/collaboration-issues.md @@ -14,9 +14,9 @@ formalized. Common questions need to be answered in an unified, principled way. ### Source code and data versioning -- How do you avoid discrepancies between versions of the source code and - versions of the data files when the data cannot fit into a traditional - repository format? +- How do you avoid discrepancies between + [revisions](https://git-scm.com/docs/revisions) of source code and versions of + data files, when the data cannot fit into a traditional repository? ### Experiment time log diff --git a/public/static/docs/understanding-dvc/related-technologies.md b/public/static/docs/understanding-dvc/related-technologies.md index d6d367a392..185580d9dc 100644 --- a/public/static/docs/understanding-dvc/related-technologies.md +++ b/public/static/docs/understanding-dvc/related-technologies.md @@ -74,10 +74,10 @@ Luigi, etc. - File tracking: - - DVC tracks files based on checksum (MD5) instead of file timestamps. This - helps avoid running into heavy processes like model retraining when you - checkout a previous, trained version of a model's code (Make would retrain - the model). + - DVC tracks files based on their checksum (MD5) instead of file timestamps. + This helps avoid running into heavy processes like model retraining when you + checkout a previously trained version of a model (Make would retrain the + model). - DVC uses file timestamps and inodes for optimization. This allows DVC to avoid recomputing all dependency files' checksums, which would be highly diff --git a/public/static/docs/understanding-dvc/what-is-dvc.md b/public/static/docs/understanding-dvc/what-is-dvc.md index 4d8ab011b3..f2fd0533ed 100644 --- a/public/static/docs/understanding-dvc/what-is-dvc.md +++ b/public/static/docs/understanding-dvc/what-is-dvc.md @@ -18,15 +18,16 @@ branch or commit. DVC uses a few core concepts: -- **Experiment**: Equivalent to a Git repository version. Each experiment - (extract new features, change model hyperparameters, data cleaning, add a new - data source) should be performed in a separate branch and then merged into the - master branch only if the experiment is successful. DVC allows experiments to - be integrated into a Git repository history and NEVER needs to recompute the - results after a successful merge. +- **Experiment**: Equivalent to a + [Git-revision](https://git-scm.com/docs/revisions). Each experiment (extract + new features, change model hyperparameters, data cleaning, add a new data + source) should be performed in a separate branch or tag. DVC allows + experiments to be integrated into a Git repository history and never needs to + recompute the results after a successful merge. - **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). Git checksum, branch name, or tag can be used as a reference to a + files). A Git commit SHA hash, branch or tag name, etc. can be used as a + [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an experiment state. - **Reproducibility**: Action to reproduce an experiment state. This action diff --git a/public/static/docs/use-cases/data-registries.md b/public/static/docs/use-cases/data-registries.md index 7c848f22c5..68d3a4ddaf 100644 --- a/public/static/docs/use-cases/data-registries.md +++ b/public/static/docs/use-cases/data-registries.md @@ -109,7 +109,8 @@ This downloads `music/songs/` from the project's current working directory (anywhere in the file system with user write access). > Note that this command (as well as `dvc import`) has a `--rev` option to -> download specific versions of the data. +> download the data from a specific [commit](https://git-scm.com/docs/revisions) +> of the source repository. ### Import workflow @@ -137,13 +138,14 @@ $ dvc update dataset.dvc ``` `dvc update` downloads new and changed files, or removes deleted ones, from -`images/faces/`, based on the latest version of the source project. It also -updates the project dependency metadata in the import stage (DVC-file). +`images/faces/`, based on the latest commit in the source repo. It also updates +the project dependency metadata in the import stage (DVC-file). ### Programatic reusability of DVC data Our Python API, included with the `dvc` package installed with DVC, includes the -`open` function to load/stream data directly from external DVC projects: +`open` function to load/stream data directly from external DVC +projects: ```python import dvc.api.open diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index 567891a14d..5b94f215ee 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -83,17 +83,18 @@ There are two ways to get to the previous version of the dataset or model: a full workspace checkout, or checkout of a specific data or model file. Let's consider the full checkout first. It's quite straightforward: -> `v1.0` is a Git tag that should be created in advance to identify the dataset -> version you are interested in. Any Git reference (for example `HEAD^` or a -> commit hash) can be used instead. +> `v1.0` below is a Git tag that should be created in advance to identify the +> dataset version you are interested in. Any +> [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +> (for example `HEAD^` or a commit SHA hash) can be used instead. ```dvc $ git checkout v1.0 $ dvc checkout ``` -These commands will restore the working tree to the first snapshot we made - -code, dataset and model files all matching each other. DVC can +These commands will restore the workspace to the first snapshot we made - code, +dataset and model files all matching each other. DVC can [optimize](/doc/user-guide/large-dataset-optimization) this operation to avoid copying files each time, so `dvc checkout` is quick even if you have large dataset or model files. @@ -108,8 +109,8 @@ $ dvc checkout data.dvc ``` If you run `git status` you will see that `data.dvc` is modified and currently -points to the version `v1.0` of the dataset. Meanwhile, code and model files are -their latest versions. +points to the `v1.0` version of the cached data. Meanwhile, code +and model files are their latest versions. ![](/static/img/versioning.png) diff --git a/public/static/docs/user-guide/analytics.md b/public/static/docs/user-guide/analytics.md index a15888f431..46946f0885 100644 --- a/public/static/docs/user-guide/analytics.md +++ b/public/static/docs/user-guide/analytics.md @@ -12,8 +12,8 @@ and features based on how, where and when people use DVC. For example: - If reflinks (depends on a file system type) are supported for most users, we can keep cache protected mode off by default (see `dvc unprotect`). -- Collecting the OS version and the way DVC was installed allows us to decide - what versions of OS to prioritize and support. +- Collecting OS information and the way DVC was installed allows us to decide + which OS platforms and versions to support and prioritize. - If usage of some command is negligible small it makes us think about issues with a command or documentation. @@ -25,14 +25,14 @@ User and event data have a 14 month retention period. DVC's analytics record the following information per event: -- The DVC version, e.g. `0.22.0` -- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc -- The underlying version control system, e.g. `git` -- Command type, e.g. `CmdDataPull` -- Command return code, e.g. `1` -- Way the DVC was installed, e.g. `binary` -- A DVC analytics user ID (e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`), - generated by [`uuid`](https://docs.python.org/3/library/uuid.html) +- The DVC version e.g. `0.82.0` +- Whether DVC was installed from a binary release +- Operating system information, e.g. Ubuntu 14.04 +- Whether the project uses Git +- Command type e.g. `CmdDataPull` +- Command return code e.g. `1` +- A random user ID (e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`), generated with + [`uuid`](https://docs.python.org/3/library/uuid.html) This _does not allow us to track individual users_ but does enable us to accurately measure user counts vs. event counts. diff --git a/public/static/docs/user-guide/contributing/core.md b/public/static/docs/user-guide/contributing/core.md index 381a06f15d..fd28c66811 100644 --- a/public/static/docs/user-guide/contributing/core.md +++ b/public/static/docs/user-guide/contributing/core.md @@ -202,7 +202,7 @@ Install [Node.js](https://nodejs.org/en/download/) and then install and run Azurite: ```dvc -$ npm install -g 'azurite@<3' # Need 2.x version +$ npm install -g 'azurite@<3' $ mkdir azurite $ azurite -s -l azurite -d azurite/debug.log ``` diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 9c4542d5e1..3799677253 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -67,12 +67,11 @@ A dependency entry consists of a pair of fields: - `url`: URL of Git repository with source DVC project - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific - [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - used to import the dependency from. - - `rev_lock`: Revision or version (Git commit hash) of the external DVC - repository at the time of importing or updating (with `dvc update`) - the dependency. + Specific commit SHA hash, branch or tag name, etc. (a + [Git revision](https://git-scm.com/docs/revisions)) used to import the + dependency from. + - `rev_lock`: Git commit SHA hash of the external DVC repository + at the time of importing or updating (with `dvc update`) the dependency. > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more @@ -94,8 +93,8 @@ A metric entry consists of these fields: A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry can have any valid YAML structure containing any number of attributes. -`"meta: string"` is also possible, it doesn't need to contain a hash (a.k.a. -dictionary) structure always. +`"meta: string"` is also possible, it doesn't need to contain a _hash_ structure +(a.k.a. dictionary) always. Comments can be added to the DVC-file using `# comment` syntax. Comments and meta values are preserved between multiple executions of `dvc repro` and diff --git a/public/static/docs/user-guide/external-dependencies.md b/public/static/docs/user-guide/external-dependencies.md index 4a6257235a..1df8821ca2 100644 --- a/public/static/docs/user-guide/external-dependencies.md +++ b/public/static/docs/user-guide/external-dependencies.md @@ -185,6 +185,6 @@ outs: ``` The `url` and `rev_lock` subfields under `repo` are used to save the origin and -version of the dependency. +[version](https://git-scm.com/docs/revisions) of the dependency, respectively.
diff --git a/public/static/docs/user-guide/managing-external-data.md b/public/static/docs/user-guide/managing-external-data.md index 31244b6044..c508576b4f 100644 --- a/public/static/docs/user-guide/managing-external-data.md +++ b/public/static/docs/user-guide/managing-external-data.md @@ -28,14 +28,12 @@ supported: > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external output for a stage file use the usual `-o` and -`-O` options with the `dvc run` command, but with the external path or URL -pointing to your desired files. For cached external outputs (specified using -`-o`) you will need to -[setup an external cache](/doc/command-reference/config#cache) location that -will be used by DVC to store versions of your external file. Non-cached external -outputs (specified using `-O`) do not require an external cache to -be setup. +In order to specify an external output for a stage file, use the usual `-o` or +`-O` options of the `dvc run` command, but with the external path or URL +pointing to the file in question. For cached external outputs +(`-o`) you will need to +[setup an external cache](/doc/command-reference/config#cache) location. +Non-cached external outputs (`-O`) do not require an external cache to be setup. > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because diff --git a/public/static/docs/user-guide/running-dvc-on-windows.md b/public/static/docs/user-guide/running-dvc-on-windows.md index c88f7118f9..5fbd6245cd 100644 --- a/public/static/docs/user-guide/running-dvc-on-windows.md +++ b/public/static/docs/user-guide/running-dvc-on-windows.md @@ -24,8 +24,8 @@ Its also possible to enjoy a full Linux terminal experience with the ## Disable short-file name generation -With NTFS, user may want to disable `8dot3` as per -[this reference]() +With NTFS, users may want to disable `8dot3` as per +[this article](https://support.microsoft.com/en-us/help/121007/how-to-disable-8-3-file-name-creation-on-ntfs-partitions) to disable the short-file name generation. It is important to do so for better performance when the user has over 300K files in a single directory. @@ -51,9 +51,8 @@ guide. ## Avoid directories with large number of files The performance of NTFS degrades while handling large volumes of files in a -directory. -[Here](https://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories) -is the resource for reference. +directory, as explained in +[this issue](https://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories). ## Enabling paging with `less` diff --git a/public/static/docs/user-guide/updating-tracked-files.md b/public/static/docs/user-guide/updating-tracked-files.md index 0017cee2cb..95e771f5d8 100644 --- a/public/static/docs/user-guide/updating-tracked-files.md +++ b/public/static/docs/user-guide/updating-tracked-files.md @@ -67,7 +67,7 @@ Edit the content of the file: $ echo "new data item" >> train.tsv ``` -Add a new version of the file back to DVC: +Add the new version of the file back to DVC: ```dvc $ dvc add train.tsv