Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

term: use "hash" in MD5 "checksum" context & Git commit contexts #962

Merged
merged 30 commits into from
Feb 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5f0da08
add docs for dvc metrics diff
efiop Jan 17, 2020
9da0661
nav: add `metrics diff` to sidebar
jorgeorpinel Jan 24, 2020
4a1b775
cmd ref: typos in `metrics diff`
jorgeorpinel Jan 27, 2020
8ae2e6d
cmd ref: rewrite `metrics diff` ref and
jorgeorpinel Jan 27, 2020
041fd22
Merge branch 'master' into 921
jorgeorpinel Jan 27, 2020
c056834
cmd ref: update descs, review options, link all metrics subcmds
jorgeorpinel Jan 27, 2020
344839a
cmd ref: update cmd argument descriptions for `diff` and `metics diff`
jorgeorpinel Jan 27, 2020
5a08d37
metrics diff: big terminology review around the intro of this new com…
jorgeorpinel Jan 29, 2020
f10cf2b
term: review usage of "hash", "commit hash", "SHA", and "MD5"
jorgeorpinel Jan 29, 2020
1d14086
term: rewrite definition of "workspace"
jorgeorpinel Jan 29, 2020
e55c362
cmd ref: change link from `metrics diff` options to `metrics show`
jorgeorpinel Jan 30, 2020
f989586
cmd ref: update example in `dvc metrics diff` and similar ones
jorgeorpinel Jan 30, 2020
734994a
cmd ref: simplify dvc gc -a option
jorgeorpinel Jan 30, 2020
961b513
cmd ref: use "reference" more than "revision" in diff
jorgeorpinel Jan 30, 2020
6b259ba
cmd ref: link term "revision" in diff and `metrics diff`
jorgeorpinel Jan 30, 2020
c006d18
term: put Git ref exapmles before term and link
jorgeorpinel Jan 30, 2020
e76329a
cmd ref: friendlier explanation of "tip of default branch"
jorgeorpinel Jan 30, 2020
d02ccd2
cmd ref: use tag name instead of term "the revision"
jorgeorpinel Jan 30, 2020
14d4c23
term: revert some "revision"->"reference" changes, and related simpli…
jorgeorpinel Jan 30, 2020
e7e0b97
cmd ref: review desc. of `-a` options throughout refs
jorgeorpinel Jan 30, 2020
c5dbb96
cmd ref: update diff params
jorgeorpinel Jan 30, 2020
b30df29
cmd ref: update notes around moving/static Git refs in import and update
jorgeorpinel Jan 30, 2020
1e9f3ae
revert workspace glossary entry
jorgeorpinel Jan 30, 2020
af6fc63
tutorial: use full name of Deep Dive Tutorial in title and links
jorgeorpinel Jan 30, 2020
bd0c9bd
user-guide: undo change on "binary" literal for analytics example
jorgeorpinel Jan 30, 2020
a0c51ff
use-cases: avoid term "revision" in data-registries
jorgeorpinel Jan 30, 2020
67e025e
term: avoid "checksum" in favor of file "hash" value
jorgeorpinel Jan 30, 2020
2793b97
Merge branch 'master' into term/checksum
jorgeorpinel Feb 12, 2020
355391a
Merge branch 'master' into term/checksum and
jorgeorpinel Feb 13, 2020
67bc455
term: SHA hash -> hash (Git commit context)
jorgeorpinel Feb 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions public/static/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ improvements) we have done in the last few months:

- ⚡️ **Performance optimizations.** The most notable one is the migration from
using a plain JSON file to an (embedded) SQLLite instance, to cache file and
directory checksums. Another one is improved performance, stability and
general user experience for the commands that navigate tags or branches (all
the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`).
directory hashes. Another one is improved performance, stability and general
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
user experience for the commands that navigate tags or branches (all the
commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`).

There are new [integrations and plugins](/doc/install/plugins) available:

Expand Down
14 changes: 7 additions & 7 deletions public/static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,16 @@ that becomes [external outputs](/doc/user-guide/managing-external-data).
Under the hood, a few actions are taken for each file (or directory) in
`targets`:

1. Calculate the file checksum.
1. Calculate the file hashes.
2. Move the file contents to the cache directory (by default in `.dvc/cache`),
using the checksum to form the cached file names. (See
using the file hash to form the cached file names. (See
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
for more details.)
3. Attempt to replace the file by a link to the file in cache (more details
below).
4. Create a corresponding DVC-file and store the checksum to identify the cached
file. Unless the `-f` option is used, the DVC-file name generated by default
is `<file>.dvc`, where `<file>` is the file name of the first target.
4. Create a corresponding DVC-file and store the file hash to identify the
cached file. Unless the `-f` option is used, the DVC-file name generated by
default is `<file>.dvc`, where `<file>` is the file name of the first target.
5. Unless `dvc init --no-scm` was used when initializing the project, add the
`targets` to `.gitignore` in order to prevent them from being committed to
the Git repository.
Expand All @@ -48,7 +48,7 @@ Under the hood, a few actions are taken for each file (or directory) in

The result is that the target data gets cached by DVC, and instead small
DVC-files can be tracked with Git. The DVC-file lists the added file as an
output (`outs` field), and references the cached file using the checksum. See
output (`outs` field), and references the cached file using its hash. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.

> Note that DVC-files created by this command are considered _orphans_ because
Expand Down Expand Up @@ -150,7 +150,7 @@ meta: # Special field to contain arbitary user data
email: [email protected]
```

This is a standard DVC-file with only an `outs` entry. The checksum should
This is a standard DVC-file with only an `outs` entry. The hash value should
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
correspond to an entry in the <abbr>cache</abbr>.

> Note that the `meta` values above were entered manually for this example. Meta
Expand Down
14 changes: 7 additions & 7 deletions public/static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ The execution of `dvc checkout` does the following:

- Scans the DVC-files to compare against the data files or directories in the
<abbr>workspace</abbr>. DVC knows which data (<abbr>outputs</abbr>) match
because the corresponding file hash values are saved in the `outs` fields in
the DVC-files. Scanning is limited to the given `targets` (if any). See also
because the corresponding hash values are saved in the `outs` fields in the
DVC-files. Scanning is limited to the given `targets` (if any). See also
options `--with-deps` and `--recursive` below.

- Missing data files or directories, or those that don't match with any
Expand Down Expand Up @@ -147,7 +147,7 @@ bigrams-experiment <- Uses bigrams to improve the model
This project comes with a predefined HTTP
[remote storage](/doc/command-reference/remote). We can now just run `dvc pull`
that will fetch and checkout the most recent `model.pkl`, `data.xml`, and other
files that are under DVC control. The model file checksum
files that are under DVC control. The model file hash
`3863d0e317dee0a55c4e59d2ec0eef33` will be used in the `train.dvc`
[stage file](/doc/command-reference/run):

Expand Down Expand Up @@ -195,10 +195,10 @@ MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7
```

What happened is that DVC went through the DVC-files and adjusted the current
set of files to match the `outs` in them. `dvc fetch` is run this once to
download missing data from the remote storage to the <abbr>cache</abbr>.
(Alternatively, we could have just run `dvc pull` to do `dvc fetch` +
`dvc checkout` in one step.)
set of <abbr>output</abbr> files to match the `outs` in them. `dvc fetch` is run
this once to download missing data from the remote storage to the
<abbr>cache</abbr>. (Alternatively, we could have just run `dvc pull` to do
`dvc fetch` + `dvc checkout` in one step.)

## Example: Automating DVC checkout

Expand Down
16 changes: 8 additions & 8 deletions public/static/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ Let's take a look at what is happening in the fist scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
<abbr>cache</abbr> after creating a DVC-file. What _commit_ means is that DVC:

- Computes a checksum for the file/directory.
- Enters the checksum and file name into the DVC-file.
- Computes a hash for the file/directory.
- Enters the hash value and file name into the DVC-file.
- Tells Git to ignore the file/directory (adding an entry to `.gitignore`).
(Note that if the <abbr>project</abbr> was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.)
Expand All @@ -59,10 +59,10 @@ DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents the last step
from occurring (on the commands where it's available), saving time and space by
not storing unwanted <abbr>data artifacts</abbr>. Checksums is still computed
and added to the DVC-file, but the actual data file is not saved in the cache.
This is where the `dvc commit` command comes into play. It performs that last
step (saving the data in cache).
not storing unwanted <abbr>data artifacts</abbr>. The file hash is still
computed and added to the DVC-file, but the actual data file is not saved in the
cache. This is where the `dvc commit` command comes into play. It performs that
last step (saving the data in cache).

Note that it's best to avoid the last two scenarios. They essentially
force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to
Expand All @@ -81,7 +81,7 @@ reproducibility in those cases.
for this option to have effect. Determines the files to commit by searching
each target directory and its subdirectories for DVC-files to inspect.

- `-f`, `--force` - commit data even if checksums for dependencies or outputs
- `-f`, `--force` - commit data even if hash values for dependencies or outputs
did not change.

- `-h`, `--help` - prints the usage/help message, and exit.
Expand Down Expand Up @@ -196,7 +196,7 @@ wdir: .

To verify this instance of `model.pkl` is not in the cache, we must know the
path to the cached file. In the cache directory, the first two characters of the
checksum are used as a subdirectory name, and the remaining characters are the
hash value are used as a subdirectory name, and the remaining characters are the
file name. Therefore, had the file been committed to the cache, it would appear
in the directory `.dvc/cache/70`. Let's check:

Expand Down
14 changes: 7 additions & 7 deletions public/static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ This is the main section with the general config options:
[anonymized usage statistics](/doc/user-guide/analytics). Accepts values
`true` (default) and `false`.

- `core.checksum_jobs` - number of threads for computing checksums. Accepts
- `core.checksum_jobs` - number of threads for computing file hashes. Accepts
positive integers. The default value is `max(1, min(4, cpu_count() // 2))`.

- `core.hardlink_lock` - use hardlink file locks instead of the default ones,
Expand Down Expand Up @@ -168,9 +168,9 @@ for more details.) This section contains the following options:

> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
> it may cause possible checksum overlaps. Checksum for some data file on an
> external storage can potentially collide with checksum generated locally for
> a different file, with a different content.
> it may cause possible file hash overlaps: the hash of a data file in
> external storage could collide with a hash generated locally for another
> file with a different content.

- `cache.s3` - name of an
[Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3).
Expand All @@ -191,9 +191,9 @@ learn more about the state file (database) that is used for optimization.

- `state.row_limit` - maximum number of entries in the state database, which
affects the physical size of the state file itself, as well as the performance
of certain DVC operations. The bigger the limit the more checksum history DVC
can keep in order to avoid sequential checksum recalculations for the files.
Default limit is set to 10 000 000 rows.
of certain DVC operations. The bigger the limit, the longer the file hash
history that DVC can keep, in order to avoid sequential hash recalculations.
The default limit is set to 10,000,000 rows.

- `state.row_cleanup_quota` - percentage of the state database that is going to
be deleted when it hits the `state.row_limit`. When an entry in the database
Expand Down
16 changes: 8 additions & 8 deletions public/static/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ positional arguments:

## Description

Given two commit SHA hashes, branch or tag names, etc.
Given two commit hashes, branch or tag names, etc.
([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this
command shows a comparative summary of basic statistics: how many files were
deleted/changed, and the file size differences.
command shows a comparative summary of basic statistics related to files tracked
by DVC: how many files were deleted/changed, and the file size differences.

> Note that `dvc diff` does not show the line-to-line comparisons like
> `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This
Expand Down Expand Up @@ -78,12 +78,12 @@ Preparing to download data from 'https://remote.dvc.org/get-started'

## Example: Previous commit in the same branch

The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from
which to calculate the difference. The "until" reference (`b_ref`) defaults to
`HEAD` (current Git commit).
The minimal `dvc diff`, run without arguments, defaults to comparing DVC-tacked
files between `HEAD` (current Git commit) and the current <abbr>workspace</abbr>
(uncommitted changes, if any).

To see the difference with the very previous commit of the project, we can use
`HEAD^` as `a_ref`:
To see the difference between the very previous commit of the project and the
workspace, we can use `HEAD^` as `a_ref`:

```dvc
$ dvc diff HEAD^
Expand Down
19 changes: 9 additions & 10 deletions public/static/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,10 @@ for more information on how to configure different remote storage providers.
`dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands
perform data synchronization among local and remote storage. The specific way in
which the set of files to push/fetch/pull is determined begins with calculating
the checksums of the files in question, when these are
[added](/doc/get-started/add-files) to DVC. File checksums are then stored in
the corresponding DVC-files (usually saved in a Git branch). Only the checksums
specified in DVC-files currently in the project are considered by `dvc fetch`
(unless the `-a` or `-T` options are used).
file hashes when these are [added](/doc/get-started/add-files) to DVC. File
hashes are stored in the corresponding DVC-files (typically versioned with Git).
Only the hashes specified in DVC-files currently in the workspace are considered
by `dvc fetch` (unless the `-a` or `-T` options are used).

## Options

Expand Down Expand Up @@ -103,7 +102,7 @@ specified in DVC-files currently in the project are considered by `dvc fetch`
- `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note
that both options can be combined, for example using the `-aT` flag.

- `--show-checksums` - show checksums instead of file names when printing the
- `--show-checksums` - show file hashes instead of file names when printing the
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
download progress.

* `-h`, `--help` - prints the usage/help message, and exit.
Expand Down Expand Up @@ -194,8 +193,8 @@ Note that the `.dvc/cache` directory was created and populated.
> for more info.

As seen above, used without arguments, `dvc fetch` downloads all assets needed
by all DVC-files in the current branch, including for directories. The checksums
`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
by all DVC-files in the current branch, including for directories. The hash
values `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
correspond to the `model.pkl` file and `data/features/` directory, respectively.

Let's now link files from the cache to the workspace with:
Expand Down Expand Up @@ -232,8 +231,8 @@ $ tree .dvc/cache
> Note that `prepare.dvc` is the first stage in our example's pipeline.

Cache entries for the necessary directories, as well as the actual
`data/prepared/test.tsv` and `data/prepared/train.tsv` files were download,
checksums shown above.
`data/prepared/test.tsv` and `data/prepared/train.tsv` files were downloaded.
Their hash values are shown above.

## Example: With dependencies

Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ $ du -sh .dvc/cache/
```

When you run `dvc gc` it removes all objects from cache that are not referenced
in the <abbr>workspace</abbr> (by collecting hash sums from the DVC-files):
in the <abbr>workspace</abbr> (by collecting hash values from the DVC-files):

```dvc
$ dvc gc
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ name.
an existing directory is specified, then the output will be placed inside of
it.

- `--rev` - commit SHA hash, branch or tag name, etc. (any
- `--rev` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.
Expand Down Expand Up @@ -134,7 +134,7 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951

`remote.dvc.org/get-started` is an HTTP
[DVC remote](/doc/command-reference/remote), whereas
`662eb7f64216d9c2c1088d0a5e2c6951` is the file's checksum.
`662eb7f64216d9c2c1088d0a5e2c6951` is the file hash.

## Example: Compare different versions of data or model

Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ outs:
The DVC-file is nearly the same as in the previous example. The difference is
that the dependency (`deps`) now references the local file in the data store
directory we created previously. (Its `path` has the URL for the data store.)
And instead of an `etag` we have an `md5` checksum. We did this so its easy to
And instead of an `etag` we have an `md5` hash value. We did this so its easy to
edit the data file.

Let's now manually reproduce a
Expand Down Expand Up @@ -306,8 +306,8 @@ Data and pipelines are up to date.

In the data store directory, edit `data.xml`. It doesn't matter what you change,
as long as it remains a valid XML file, because any change will result in a
different dependency file checksum (`md5`) in the import stage DVC-file. Once we
do so, we can run `dvc update` to make sure the import stage is up to date:
different dependency file hash (`md5`) in the import stage DVC-file. Once we do
so, we can run `dvc update` to make sure the import stage is up to date:

```dvc
$ dvc update data.xml.dvc
Expand Down
9 changes: 4 additions & 5 deletions public/static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ data artifact from the source repo.
an existing directory is specified, then the output will be placed inside of
it.

- `--rev` - commit SHA hash, branch or tag name, etc. (any
- `--rev` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.
Expand Down Expand Up @@ -159,10 +159,9 @@ deps:
If `rev` is a Git branch or tag (where the commit it points to changes), the
data source may have updates at a later time. To bring it up to date if so (and
update `rev_lock` in the DVC-file), simply use `dvc update <stage>.dvc`. If
`rev` is a specific commit SHA hash (does not change), `dvc update` will never
have an effect on the import stage. You may **re-import** a different commit
instead, by using `dvc import` again with a different (or without) `--rev`. For
example:
`rev` is a specific commit hash (does not change), `dvc update` will never have
an effect on the import stage. You may **re-import** a different commit instead,
by using `dvc import` again with a different (or without) `--rev`. For example:

```dvc
$ dvc import --rev master \
Expand Down
19 changes: 10 additions & 9 deletions public/static/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present).

Namely:

**Checkout**: For any commit SHA hash, branch or tag, `git checkout` retrieves
the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version.
The project's DVC-files in turn refer to data stored in <abbr>cache</abbr>, but
not necessarily in the <abbr>workspace</abbr>. Normally, it would be necessary
to run `dvc checkout` to synchronize workspace and DVC-files.
**Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the
[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The
project's DVC-files in turn refer to data stored in <abbr>cache</abbr>, but not
necessarily in the <abbr>workspace</abbr>. Normally, it would be necessary to
run `dvc checkout` to synchronize workspace and DVC-files.

This hook automates running `dvc checkout`.

Expand Down Expand Up @@ -174,10 +174,11 @@ running `git checkout master`.
We also see that the first `dvc status` tells us about differences between the
project's <abbr>cache</abbr> and the data files currently in the workspace. Git
changed the DVC-files in the workspace, which changed references to data files.
What `dvc status` did is inform us the data files in the workspace no longer
matched the checksums in the [DVC-files](/doc/user-guide/dvc-file-format).
Running `dvc checkout` then checks out the corresponding data files, and a
second `dvc status` now tells us the data files match the DVC-files.
`dvc status` first informed us that the data files in the workspace no longer
matched the hash values in the corresponding
[DVC-files](/doc/user-guide/dvc-file-format). Running `dvc checkout` then brings
them up to date, and a second `dvc status` tells us that the data files now do
match the DVC-files.

```dvc
$ git checkout master
Expand Down
5 changes: 2 additions & 3 deletions public/static/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,9 +213,8 @@ $ dvc remote modify myremote gdrive_client_secret <client secret>

Note that GDrive remotes are not "trusted" by default. This means that the
[`verify`](/doc/command-reference/remote/modify#available-settings-for-all-remotes)
option is enabled on this type of storage, so DVC recalculates the checksums of
files upon download (e.g. `dvc pull`), to make sure that these haven't been
modified.
option is enabled on this type of storage, so DVC recalculates the file hashes
upon download (e.g. `dvc pull`), to make sure that these haven't been modified.

> Please note our [Privacy Policy (Google APIs)](/doc/user-guide/privacy).

Expand Down
Loading