Skip to content

Commit

Permalink
term: "checksum" vs "hash" (MD5 context) & add "SHA" (Git commit cont…
Browse files Browse the repository at this point in the history
…ext) (#962)

* add docs for dvc metrics diff

* nav: add `metrics diff` to sidebar

* cmd ref: typos in `metrics diff`

* cmd ref: rewrite `metrics diff` ref and
and review related concepts throughout docs e.g. "Git reference", "working tree"

* cmd ref: update descs, review options, link all metrics subcmds
addresses #933 (review)
as well as #933 (review)
and #933 (review)

* cmd ref: update cmd argument descriptions for `diff` and `metics diff`

* metrics diff: big terminology review around the intro of this new command
per #933 (review) et al.

* term: review usage of "hash", "commit hash", "SHA", and "MD5"
per #933 (review)

* term: rewrite definition of "workspace"
per #933 (review)

* cmd ref: change link from `metrics diff` options to `metrics show`
per #933 (comment)

* cmd ref: update example in `dvc metrics diff` and similar ones
per #933 (review)

* cmd ref: simplify dvc gc -a option
per #933 (review)
and #933 (review)

* cmd ref: use "reference" more than "revision" in diff
per #933 (review)

* cmd ref: link term "revision" in diff and `metrics diff`
also per #933 (review)

* term: put Git ref exapmles before term and link
per #933 (review)

* cmd ref: friendlier explanation of "tip of default branch"
per #933 (review)

* cmd ref: use tag name instead of term "the revision"
per #933 (review)

* term: revert some "revision"->"reference" changes, and related simplifications
per #933 (review)

* cmd ref: review desc. of `-a` options throughout refs

* cmd ref: update diff params
per iterative/dvc/pull/3244

* cmd ref: update notes around moving/static Git refs in import and update
per #933 (review)

* revert workspace glossary entry
per #933 (review)

* tutorial: use full name of Deep Dive Tutorial in title and links
per #933 (review)

* user-guide: undo change on "binary" literal for analytics example
per #933 (review)

* use-cases: avoid term "revision" in data-registries
per #933 (review)

* term: avoid "checksum" in favor of file "hash" value
per #933 (review)

* term: SHA hash -> hash (Git commit context)
per #962 (comment)

Co-authored-by: Ruslan Kuprieiev <[email protected]>
  • Loading branch information
jorgeorpinel and efiop authored Feb 13, 2020
1 parent a94e7cc commit 01b5efa
Show file tree
Hide file tree
Showing 30 changed files with 143 additions and 149 deletions.
6 changes: 3 additions & 3 deletions public/static/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ improvements) we have done in the last few months:

- ⚡️ **Performance optimizations.** The most notable one is the migration from
using a plain JSON file to an (embedded) SQLLite instance, to cache file and
directory checksums. Another one is improved performance, stability and
general user experience for the commands that navigate tags or branches (all
the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`).
directory hashes. Another one is improved performance, stability and general
user experience for the commands that navigate tags or branches (all the
commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`).

There are new [integrations and plugins](/doc/install/plugins) available:

Expand Down
14 changes: 7 additions & 7 deletions public/static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,16 @@ that becomes [external outputs](/doc/user-guide/managing-external-data).
Under the hood, a few actions are taken for each file (or directory) in
`targets`:

1. Calculate the file checksum.
1. Calculate the file hashes.
2. Move the file contents to the cache directory (by default in `.dvc/cache`),
using the checksum to form the cached file names. (See
using the file hash to form the cached file names. (See
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
for more details.)
3. Attempt to replace the file by a link to the file in cache (more details
below).
4. Create a corresponding DVC-file and store the checksum to identify the cached
file. Unless the `-f` option is used, the DVC-file name generated by default
is `<file>.dvc`, where `<file>` is the file name of the first target.
4. Create a corresponding DVC-file and store the file hash to identify the
cached file. Unless the `-f` option is used, the DVC-file name generated by
default is `<file>.dvc`, where `<file>` is the file name of the first target.
5. Unless `dvc init --no-scm` was used when initializing the project, add the
`targets` to `.gitignore` in order to prevent them from being committed to
the Git repository.
Expand All @@ -48,7 +48,7 @@ Under the hood, a few actions are taken for each file (or directory) in

The result is that the target data gets cached by DVC, and instead small
DVC-files can be tracked with Git. The DVC-file lists the added file as an
output (`outs` field), and references the cached file using the checksum. See
output (`outs` field), and references the cached file using its hash. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.

> Note that DVC-files created by this command are considered _orphans_ because
Expand Down Expand Up @@ -150,7 +150,7 @@ meta: # Special field to contain arbitary user data
email: [email protected]
```
This is a standard DVC-file with only an `outs` entry. The checksum should
This is a standard DVC-file with only an `outs` entry. The hash value should
correspond to an entry in the <abbr>cache</abbr>.

> Note that the `meta` values above were entered manually for this example. Meta
Expand Down
14 changes: 7 additions & 7 deletions public/static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ The execution of `dvc checkout` does the following:

- Scans the DVC-files to compare against the data files or directories in the
<abbr>workspace</abbr>. DVC knows which data (<abbr>outputs</abbr>) match
because the corresponding file hash values are saved in the `outs` fields in
the DVC-files. Scanning is limited to the given `targets` (if any). See also
because the corresponding hash values are saved in the `outs` fields in the
DVC-files. Scanning is limited to the given `targets` (if any). See also
options `--with-deps` and `--recursive` below.

- Missing data files or directories, or those that don't match with any
Expand Down Expand Up @@ -147,7 +147,7 @@ bigrams-experiment <- Uses bigrams to improve the model
This project comes with a predefined HTTP
[remote storage](/doc/command-reference/remote). We can now just run `dvc pull`
that will fetch and checkout the most recent `model.pkl`, `data.xml`, and other
files that are under DVC control. The model file checksum
files that are under DVC control. The model file hash
`3863d0e317dee0a55c4e59d2ec0eef33` will be used in the `train.dvc`
[stage file](/doc/command-reference/run):

Expand Down Expand Up @@ -195,10 +195,10 @@ MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7
```

What happened is that DVC went through the DVC-files and adjusted the current
set of files to match the `outs` in them. `dvc fetch` is run this once to
download missing data from the remote storage to the <abbr>cache</abbr>.
(Alternatively, we could have just run `dvc pull` to do `dvc fetch` +
`dvc checkout` in one step.)
set of <abbr>output</abbr> files to match the `outs` in them. `dvc fetch` is run
this once to download missing data from the remote storage to the
<abbr>cache</abbr>. (Alternatively, we could have just run `dvc pull` to do
`dvc fetch` + `dvc checkout` in one step.)

## Example: Automating DVC checkout

Expand Down
16 changes: 8 additions & 8 deletions public/static/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ Let's take a look at what is happening in the fist scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
<abbr>cache</abbr> after creating a DVC-file. What _commit_ means is that DVC:

- Computes a checksum for the file/directory.
- Enters the checksum and file name into the DVC-file.
- Computes a hash for the file/directory.
- Enters the hash value and file name into the DVC-file.
- Tells Git to ignore the file/directory (adding an entry to `.gitignore`).
(Note that if the <abbr>project</abbr> was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.)
Expand All @@ -59,10 +59,10 @@ DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents the last step
from occurring (on the commands where it's available), saving time and space by
not storing unwanted <abbr>data artifacts</abbr>. Checksums is still computed
and added to the DVC-file, but the actual data file is not saved in the cache.
This is where the `dvc commit` command comes into play. It performs that last
step (saving the data in cache).
not storing unwanted <abbr>data artifacts</abbr>. The file hash is still
computed and added to the DVC-file, but the actual data file is not saved in the
cache. This is where the `dvc commit` command comes into play. It performs that
last step (saving the data in cache).

Note that it's best to avoid the last two scenarios. They essentially
force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to
Expand All @@ -81,7 +81,7 @@ reproducibility in those cases.
for this option to have effect. Determines the files to commit by searching
each target directory and its subdirectories for DVC-files to inspect.

- `-f`, `--force` - commit data even if checksums for dependencies or outputs
- `-f`, `--force` - commit data even if hash values for dependencies or outputs
did not change.

- `-h`, `--help` - prints the usage/help message, and exit.
Expand Down Expand Up @@ -196,7 +196,7 @@ wdir: .
To verify this instance of `model.pkl` is not in the cache, we must know the
path to the cached file. In the cache directory, the first two characters of the
checksum are used as a subdirectory name, and the remaining characters are the
hash value are used as a subdirectory name, and the remaining characters are the
file name. Therefore, had the file been committed to the cache, it would appear
in the directory `.dvc/cache/70`. Let's check:

Expand Down
14 changes: 7 additions & 7 deletions public/static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ This is the main section with the general config options:
[anonymized usage statistics](/doc/user-guide/analytics). Accepts values
`true` (default) and `false`.

- `core.checksum_jobs` - number of threads for computing checksums. Accepts
- `core.checksum_jobs` - number of threads for computing file hashes. Accepts
positive integers. The default value is `max(1, min(4, cpu_count() // 2))`.

- `core.hardlink_lock` - use hardlink file locks instead of the default ones,
Expand Down Expand Up @@ -168,9 +168,9 @@ for more details.) This section contains the following options:

> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
> it may cause possible checksum overlaps. Checksum for some data file on an
> external storage can potentially collide with checksum generated locally for
> a different file, with a different content.
> it may cause possible file hash overlaps: the hash of a data file in
> external storage could collide with a hash generated locally for another
> file with a different content.
- `cache.s3` - name of an
[Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3).
Expand All @@ -191,9 +191,9 @@ learn more about the state file (database) that is used for optimization.

- `state.row_limit` - maximum number of entries in the state database, which
affects the physical size of the state file itself, as well as the performance
of certain DVC operations. The bigger the limit the more checksum history DVC
can keep in order to avoid sequential checksum recalculations for the files.
Default limit is set to 10 000 000 rows.
of certain DVC operations. The bigger the limit, the longer the file hash
history that DVC can keep, in order to avoid sequential hash recalculations.
The default limit is set to 10,000,000 rows.

- `state.row_cleanup_quota` - percentage of the state database that is going to
be deleted when it hits the `state.row_limit`. When an entry in the database
Expand Down
16 changes: 8 additions & 8 deletions public/static/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ positional arguments:

## Description

Given two commit SHA hashes, branch or tag names, etc.
Given two commit hashes, branch or tag names, etc.
([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this
command shows a comparative summary of basic statistics: how many files were
deleted/changed, and the file size differences.
command shows a comparative summary of basic statistics related to files tracked
by DVC: how many files were deleted/changed, and the file size differences.

> Note that `dvc diff` does not show the line-to-line comparisons like
> `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This
Expand Down Expand Up @@ -78,12 +78,12 @@ Preparing to download data from 'https://remote.dvc.org/get-started'

## Example: Previous commit in the same branch

The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from
which to calculate the difference. The "until" reference (`b_ref`) defaults to
`HEAD` (current Git commit).
The minimal `dvc diff`, run without arguments, defaults to comparing DVC-tacked
files between `HEAD` (current Git commit) and the current <abbr>workspace</abbr>
(uncommitted changes, if any).

To see the difference with the very previous commit of the project, we can use
`HEAD^` as `a_ref`:
To see the difference between the very previous commit of the project and the
workspace, we can use `HEAD^` as `a_ref`:

```dvc
$ dvc diff HEAD^
Expand Down
19 changes: 9 additions & 10 deletions public/static/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,10 @@ for more information on how to configure different remote storage providers.
`dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands
perform data synchronization among local and remote storage. The specific way in
which the set of files to push/fetch/pull is determined begins with calculating
the checksums of the files in question, when these are
[added](/doc/get-started/add-files) to DVC. File checksums are then stored in
the corresponding DVC-files (usually saved in a Git branch). Only the checksums
specified in DVC-files currently in the project are considered by `dvc fetch`
(unless the `-a` or `-T` options are used).
file hashes when these are [added](/doc/get-started/add-files) to DVC. File
hashes are stored in the corresponding DVC-files (typically versioned with Git).
Only the hashes specified in DVC-files currently in the workspace are considered
by `dvc fetch` (unless the `-a` or `-T` options are used).

## Options

Expand Down Expand Up @@ -103,7 +102,7 @@ specified in DVC-files currently in the project are considered by `dvc fetch`
- `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note
that both options can be combined, for example using the `-aT` flag.

- `--show-checksums` - show checksums instead of file names when printing the
- `--show-checksums` - show file hashes instead of file names when printing the
download progress.

* `-h`, `--help` - prints the usage/help message, and exit.
Expand Down Expand Up @@ -194,8 +193,8 @@ Note that the `.dvc/cache` directory was created and populated.
> for more info.
As seen above, used without arguments, `dvc fetch` downloads all assets needed
by all DVC-files in the current branch, including for directories. The checksums
`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
by all DVC-files in the current branch, including for directories. The hash
values `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
correspond to the `model.pkl` file and `data/features/` directory, respectively.

Let's now link files from the cache to the workspace with:
Expand Down Expand Up @@ -232,8 +231,8 @@ $ tree .dvc/cache
> Note that `prepare.dvc` is the first stage in our example's pipeline.
Cache entries for the necessary directories, as well as the actual
`data/prepared/test.tsv` and `data/prepared/train.tsv` files were download,
checksums shown above.
`data/prepared/test.tsv` and `data/prepared/train.tsv` files were downloaded.
Their hash values are shown above.

## Example: With dependencies

Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ $ du -sh .dvc/cache/
```

When you run `dvc gc` it removes all objects from cache that are not referenced
in the <abbr>workspace</abbr> (by collecting hash sums from the DVC-files):
in the <abbr>workspace</abbr> (by collecting hash values from the DVC-files):

```dvc
$ dvc gc
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ name.
an existing directory is specified, then the output will be placed inside of
it.

- `--rev` - commit SHA hash, branch or tag name, etc. (any
- `--rev` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.
Expand Down Expand Up @@ -134,7 +134,7 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951

`remote.dvc.org/get-started` is an HTTP
[DVC remote](/doc/command-reference/remote), whereas
`662eb7f64216d9c2c1088d0a5e2c6951` is the file's checksum.
`662eb7f64216d9c2c1088d0a5e2c6951` is the file hash.

## Example: Compare different versions of data or model

Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ outs:
The DVC-file is nearly the same as in the previous example. The difference is
that the dependency (`deps`) now references the local file in the data store
directory we created previously. (Its `path` has the URL for the data store.)
And instead of an `etag` we have an `md5` checksum. We did this so its easy to
And instead of an `etag` we have an `md5` hash value. We did this so its easy to
edit the data file.

Let's now manually reproduce a
Expand Down Expand Up @@ -306,8 +306,8 @@ Data and pipelines are up to date.

In the data store directory, edit `data.xml`. It doesn't matter what you change,
as long as it remains a valid XML file, because any change will result in a
different dependency file checksum (`md5`) in the import stage DVC-file. Once we
do so, we can run `dvc update` to make sure the import stage is up to date:
different dependency file hash (`md5`) in the import stage DVC-file. Once we do
so, we can run `dvc update` to make sure the import stage is up to date:

```dvc
$ dvc update data.xml.dvc
Expand Down
9 changes: 4 additions & 5 deletions public/static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ data artifact from the source repo.
an existing directory is specified, then the output will be placed inside of
it.

- `--rev` - commit SHA hash, branch or tag name, etc. (any
- `--rev` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.
Expand Down Expand Up @@ -159,10 +159,9 @@ deps:
If `rev` is a Git branch or tag (where the commit it points to changes), the
data source may have updates at a later time. To bring it up to date if so (and
update `rev_lock` in the DVC-file), simply use `dvc update <stage>.dvc`. If
`rev` is a specific commit SHA hash (does not change), `dvc update` will never
have an effect on the import stage. You may **re-import** a different commit
instead, by using `dvc import` again with a different (or without) `--rev`. For
example:
`rev` is a specific commit hash (does not change), `dvc update` will never have
an effect on the import stage. You may **re-import** a different commit instead,
by using `dvc import` again with a different (or without) `--rev`. For example:

```dvc
$ dvc import --rev master \
Expand Down
19 changes: 10 additions & 9 deletions public/static/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present).

Namely:

**Checkout**: For any commit SHA hash, branch or tag, `git checkout` retrieves
the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version.
The project's DVC-files in turn refer to data stored in <abbr>cache</abbr>, but
not necessarily in the <abbr>workspace</abbr>. Normally, it would be necessary
to run `dvc checkout` to synchronize workspace and DVC-files.
**Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the
[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The
project's DVC-files in turn refer to data stored in <abbr>cache</abbr>, but not
necessarily in the <abbr>workspace</abbr>. Normally, it would be necessary to
run `dvc checkout` to synchronize workspace and DVC-files.

This hook automates running `dvc checkout`.

Expand Down Expand Up @@ -174,10 +174,11 @@ running `git checkout master`.
We also see that the first `dvc status` tells us about differences between the
project's <abbr>cache</abbr> and the data files currently in the workspace. Git
changed the DVC-files in the workspace, which changed references to data files.
What `dvc status` did is inform us the data files in the workspace no longer
matched the checksums in the [DVC-files](/doc/user-guide/dvc-file-format).
Running `dvc checkout` then checks out the corresponding data files, and a
second `dvc status` now tells us the data files match the DVC-files.
`dvc status` first informed us that the data files in the workspace no longer
matched the hash values in the corresponding
[DVC-files](/doc/user-guide/dvc-file-format). Running `dvc checkout` then brings
them up to date, and a second `dvc status` tells us that the data files now do
match the DVC-files.

```dvc
$ git checkout master
Expand Down
5 changes: 2 additions & 3 deletions public/static/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,9 +213,8 @@ $ dvc remote modify myremote gdrive_client_secret <client secret>

Note that GDrive remotes are not "trusted" by default. This means that the
[`verify`](/doc/command-reference/remote/modify#available-settings-for-all-remotes)
option is enabled on this type of storage, so DVC recalculates the checksums of
files upon download (e.g. `dvc pull`), to make sure that these haven't been
modified.
option is enabled on this type of storage, so DVC recalculates the file hashes
upon download (e.g. `dvc pull`), to make sure that these haven't been modified.

> Please note our [Privacy Policy (Google APIs)](/doc/user-guide/privacy).
Expand Down
Loading

0 comments on commit 01b5efa

Please sign in to comment.