Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: rewrite push/pull et al. #1602

Merged
merged 14 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ corresponding versions of the DVC-tracked files and directories from the

The `targets` given to this command (if any) limit what to checkout. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).
directories), `.dvc` files, and stage names (found in `dvc.yaml`).

The execution of `dvc checkout` does the following:

Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ for multiple Git commits.

The `targets` given to this command (if any) limit what to fetch. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).
directories), `.dvc` files, and stage names (found in `dvc.yaml`).

Fetching is performed automatically by `dvc pull` (when the data is not already
in the <abbr>cache</abbr>), along with `dvc checkout`:
Expand Down
48 changes: 25 additions & 23 deletions content/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,40 +18,42 @@ positional arguments:

## Description

The `dvc pull` and `dvc push` commands are the means for uploading and
downloading data to and from remote storage. These commands are analogous to
`git pull` and `git push`, respectively.
The `dvc push` and `dvc pull` commands are the means for uploading and
downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands
are similar to `git push` and `git pull`, respectively.
[Data sharing](/doc/use-cases/sharing-data-and-model-files) across environments
and preserving data versions (input datasets, intermediate results, models,
[metrics](/doc/command-reference/metrics), etc) remotely (S3, SSH, GCS, etc.)
are the most common use cases for these commands.
[metrics](/doc/command-reference/metrics), etc.) remotely are the most common
use cases for these commands.

The `dvc pull` command allows one to retrieve data from remote storage.
`dvc pull` has the same effect as running `dvc fetch` and `dvc checkout`
immediately after.
`dvc pull` downloads data from [remote storage](/doc/command-reference/remote)
and places it in the <abbr>workspace</abbr>. It has the same effect as running
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`dvc fetch` and `dvc checkout`.

The default remote is used (see `dvc config core.remote`) unless the `--remote`
> Note that pulling data does not change any `dvc.yaml` or `.dvc` files, nor
> does it save any changes to the code, `dvc.lock`, or `.dvc` files (that should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why would I expect it to save changes in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I copied this from pull without editing accordingly. Updated.

I've also applied similar notes to fetch and commit, but didn't push it in this PR for now. I'll make another small PR soon.

> be saved with `git commit` and `git push`).

The default remote is used (see `dvc remote default`) unless the `--remote`
option is used. See `dvc remote` for more information on how to configure a
remote.

With no arguments, just `dvc pull` or `dvc pull --remote <name>`, it downloads
only the files (or directories) missing from the workspace by checking all
`.dvc` files and stages (in `dvc.yaml` and `dvc.lock`) currently in the
<abbr>project</abbr>. It will not download files associated with earlier commits
in the <abbr>repository</abbr> (if using Git), nor will it download files that
have not changed.

The command `dvc status -c` can list files referenced in current stages (in
`dvc.yaml`) or `.dvc` files, but missing from the <abbr>cache</abbr>. It can be
used to see what files `dvc pull` would download.
Without arguments, it downloads all files and directories missing from the
project, found as <abbr>outputs</abbr> of the stages or `.dvc` files present in
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
the workspace (the `--all-branches` and `--all-tags` enable using multiple
workspace versions).

The `targets` given to this command (if any) limit what to pull. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).
directories), `.dvc` files, and stage names (found in `dvc.yaml`).

After the data is in the cache, `dvc pull` uses OS-specific mechanisms like
reflinks or hardlinks to put it in the workspace, trying to avoid copying. See
`dvc checkout` for more details.
After the data is in the <abbr>cache</abbr>, `dvc pull` uses OS-specific
mechanisms like reflinks or hardlinks to put it in the workspace, trying to
avoid copying. See `dvc checkout` for more details.

Note that the command `dvc status -c` can list files referenced in current
stages (in `dvc.yaml`) or `.dvc` files, but missing from the cache. It can be
used to see what files `dvc pull` would download.

## Options

Expand Down
65 changes: 27 additions & 38 deletions content/docs/command-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,56 +17,45 @@ positional arguments:

## Description

The `dvc pull` and `dvc push` commands are the means for uploading and
downloading data to and from remote storage. These commands are similar to
`git pull` and `git push`, respectively (with some key differences given the
nature of DVC, see details below).

The `dvc push` and `dvc pull` commands are the means for uploading and
downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands
are similar to `git push` and `git pull`, respectively.
[Data sharing](/doc/use-cases/sharing-data-and-model-files) across environments,
and preserving data versions (input datasets, intermediate results, models,
[metrics](/doc/command-reference/metrics), etc.)
[remotely](/doc/command-reference/remote) are the two most common use cases for
these commands.

The `dvc push` command allows us to upload data to remote storage. It doesn't
save any changes to the code, `dvc.yaml`, or `.dvc` files (those should be saved
with `git commit` and `git push`).
[metrics](/doc/command-reference/metrics), etc.) remotely are the most common
use cases for these commands.

💡 For convenience, a Git hook is available to automate running `dvc push` after
`git push`. See `dvc install` for more details.
`dvc push` uploads data to [remote storage](/doc/command-reference/remote).

Under the hood a few actions are taken:
> Note that pushing data does not change any `dvc.yaml` or `.dvc` files, nor
> does it save any changes to the code, `dvc.lock`, or `.dvc` files (that should
> be saved with `git commit` and `git push`).

- The push command by default uses all stages (in `dvc.yaml` and `dvc.lock`) and
`.dvc` files in the <abbr>workspace</abbr>. The command options will either
limit or expand the set of stages or `.dvc` files to consult.
The default remote is used (see `dvc remote default`) unless the `--remote`
option is used. See `dvc remote` for more information on how to configure a
remote.

- For each <abbr>output</abbr> referenced in every selected stage or `.dvc`
file, DVC finds a corresponding file or directory in the <abbr>cache</abbr>.
DVC then checks whether it exists in the remote. From this, DVC gathers a list
of files missing from the remote storage.
Without arguments, it uploads all files and directories missing from remote
storage, found as <abbr>outputs</abbr> of the stages or `.dvc` files present in
the workspace (the `--all-branches` and `--all-tags` enable using multiple
workspace versions).

- Upload the cache files missing from remote storage, if any, to the remote.
The `targets` given to this command (if any) limit what to push. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, and stage names (found in `dvc.yaml`).

The DVC `push` command always works with a remote storage, and it is an error if
none are specified on the command line nor in the configuration. The default
remote is used (see `dvc config core.remote`) unless the `--remote` option is
used. See `dvc remote` for more information on how to configure a remote.
💡 For convenience, a Git hook is available to automate running `dvc push` after
`git push`. See `dvc install` for more details.

With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads
only the files (or directories) that are new in the local repository to remote
storage. It will not upload files associated with earlier commits in the
<abbr>repository</abbr> (if using Git), nor will it upload files that have not
changed.
For all <abbr>outputs</abbr> referenced in each target, DVC finds the
corresponding files and directories in the <abbr>cache</abbr> (identified by
hash values saved in `dvc.lock` and `.dvc` files). DVC then gathers a list of
files missing from the remote storage, and uploads them.

The `dvc status -c` command can list files tracked by DVC that are new in the
cache (compared to the default remote.) It can be used to see what files
Note that the `dvc status -c` command can list files tracked by DVC that are new
in the cache (compared to the default remote.) It can be used to see what files
`dvc push` would upload.

The `targets` given to this command (if any) limit what to push. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).

## Options

- `-a`, `--all-branches` - determines the files to upload by examining
Expand Down
10 changes: 6 additions & 4 deletions content/docs/command-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ in `dvc.lock` for stages) against the actual data files or directories in the
workspace. The `--all-branches`, `--all-tags`, and `--all-commits` options
enable checking data for multiple Git commits.

The `targets` given to this command (if any) limit what to check. Paths to
tracked files or directories (including paths inside tracked directories),
`.dvc` files, or stage names (found in `dvc.yaml`) are accepted.
The `targets` given to this command (if any) limit what to check. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, and stage names (found in `dvc.yaml`).

If no differences are detected, `dvc status` prints
`Data and pipelines are up to date.` If differences are detected by
Expand All @@ -49,6 +49,8 @@ differences, the changes in <abbr>dependencies</abbr> and/or
<abbr>outputs</abbr> that differ are listed. For each item listed, either the
file name or hash is shown, along with a _state description_, as detailed below:

### Local workspace status

- _changed checksum_ means that the `.dvc` file hash has changed (e.g. someone
manually edited it).

Expand Down Expand Up @@ -77,7 +79,7 @@ file name or hash is shown, along with a _state description_, as detailed below:
original data source has changed). The imported data can be brought to its
latest version by using `dvc update`.

**For comparison against remote storage:**
### Comparison against remote storage

- _new_ means that the file/directory exists in the cache but not in remote
storage.
Expand Down