Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: rewrite push/pull et al. #1602

Merged
merged 14 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions content/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,31 +19,27 @@ positional arguments:
## Description

The `dvc pull` and `dvc push` commands are the means for uploading and
downloading data to and from remote storage. These commands are analogous to
`git pull` and `git push`, respectively.
downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands
are similar to `git pull` and `git push`, respectively.

[Data sharing](/doc/use-cases/sharing-data-and-model-files) across environments
and preserving data versions (input datasets, intermediate results, models,
[metrics](/doc/command-reference/metrics), etc) remotely (S3, SSH, GCS, etc.)
are the most common use cases for these commands.
[metrics](/doc/command-reference/metrics), etc) remotely are the most common use
cases for these commands.

The `dvc pull` command allows one to retrieve data from remote storage.
`dvc pull` has the same effect as running `dvc fetch` and `dvc checkout`
immediately after.
The `dvc pull` command allows us to download data from
[remote storage](/doc/command-reference/remote) and place it in the
<abbr>workspace</abbr>. `dvc pull` has the same effect as running `dvc fetch`
and `dvc checkout`.

The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used. See `dvc remote` for more information on how to configure a
remote.

With no arguments, just `dvc pull` or `dvc pull --remote <name>`, it downloads
only the files (or directories) missing from the workspace by checking all
`.dvc` files and stages (in `dvc.yaml` and `dvc.lock`) currently in the
<abbr>project</abbr>. It will not download files associated with earlier commits
in the <abbr>repository</abbr> (if using Git), nor will it download files that
have not changed.

The command `dvc status -c` can list files referenced in current stages (in
`dvc.yaml`) or `.dvc` files, but missing from the <abbr>cache</abbr>. It can be
used to see what files `dvc pull` would download.
Without arguments, it downloads all files and directories missing from the
project, found as <abbr>outputs</abbr> in the stages (in `dvc.lock`) or `.dvc`
files present in the workspace (the `--all-branches` and `--all-tags` enable
using multiple workspace versions).

The `targets` given to this command (if any) limit what to pull. It accepts
paths to tracked files or directories (including paths inside tracked
Expand All @@ -53,6 +49,10 @@ After the data is in the cache, `dvc pull` uses OS-specific mechanisms like
reflinks or hardlinks to put it in the workspace, trying to avoid copying. See
`dvc checkout` for more details.

Note that the command `dvc status -c` can list files referenced in current
stages (in `dvc.yaml`) or `.dvc` files, but missing from the <abbr>cache</abbr>.
It can be used to see what files `dvc pull` would download.

## Options

- `-a`, `--all-branches` - determines the files to download by examining
Expand Down
65 changes: 29 additions & 36 deletions content/docs/command-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,55 +18,48 @@ positional arguments:
## Description

The `dvc pull` and `dvc push` commands are the means for uploading and
downloading data to and from remote storage. These commands are similar to
`git pull` and `git push`, respectively (with some key differences given the
nature of DVC, see details below).
downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands
are similar to `git pull` and `git push`, respectively.

[Data sharing](/doc/use-cases/sharing-data-and-model-files) across environments,
and preserving data versions (input datasets, intermediate results, models,
[metrics](/doc/command-reference/metrics), etc.)
[remotely](/doc/command-reference/remote) are the two most common use cases for
these commands.
[metrics](/doc/command-reference/metrics), etc.) remotely are the most common
use cases for these commands.

The `dvc push` command allows us to upload data to remote storage. It doesn't
save any changes to the code, `dvc.yaml`, or `.dvc` files (those should be saved
with `git commit` and `git push`).
The `dvc push` command allows us to upload data to
[remote storage](/doc/command-reference/remote). It doesn't save any changes to
the code, `dvc.yaml`, or `.dvc` files (that should be saved with `git commit`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should include dvc.lock here ...

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. That whole sentence is kind of a note though, now that I think about it. I changed it a little in 4717b78. It's still not perfect but we'll have to review this again soon, when addressing #1663.

and `git push`).

💡 For convenience, a Git hook is available to automate running `dvc push` after
`git push`. See `dvc install` for more details.

Under the hood a few actions are taken:
The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used. See `dvc remote` for more information on how to configure a
remote.

- The push command by default uses all stages (in `dvc.yaml` and `dvc.lock`) and
`.dvc` files in the <abbr>workspace</abbr>. The command options will either
limit or expand the set of stages or `.dvc` files to consult.
Without arguments, it uploads all files and directories missing from remote
storage, found as <abbr>outputs</abbr> in the stages (in `dvc.lock`) or `.dvc`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm yeah... Kinda. dvc.lock is where you find the output file names — that's what this meant. But probably this paragraph should be simplified and not even mention either one. See 246f446.

files present in the workspace (the `--all-branches` and `--all-tags` enable
using multiple workspace versions).

- For each <abbr>output</abbr> referenced in every selected stage or `.dvc`
file, DVC finds a corresponding file or directory in the <abbr>cache</abbr>.
DVC then checks whether it exists in the remote. From this, DVC gathers a list
of files missing from the remote storage.
The `targets` given to this command (if any) limit what to push. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).

- Upload the cache files missing from remote storage, if any, to the remote.
💡 For convenience, a Git hook is available to automate running `dvc push` after
`git push`. See `dvc install` for more details.

The DVC `push` command always works with a remote storage, and it is an error if
none are specified on the command line nor in the configuration. The default
remote is used (see `dvc config core.remote`) unless the `--remote` option is
used. See `dvc remote` for more information on how to configure a remote.
Under the hood, a few actions are taken:

With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads
only the files (or directories) that are new in the local repository to remote
storage. It will not upload files associated with earlier commits in the
<abbr>repository</abbr> (if using Git), nor will it upload files that have not
changed.
- The push command checks the appropriate `dvc.lock` and `.dvc` files in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc.yaml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a command-reference, but I don't think this information should be here (perhaps on Dvc files?).

Also, @jorgeorpinel, dvc.yaml is a discovery file which dvc reads to find outputs/dependencies, etc. and dvc.lock is a file to correlate those outputs/deps with the exact version through the checksums. dvc.lock is never used if there's no corresponding deps/outs/stages in dvc.yaml. So, either both dvc.yaml and dvc.lock are read together or we could just say dvc.yaml is read (and consider dvc.lock as an implementation detail).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a command-reference, but I don't think this information should be here

could you clarify please what information exactly do you mean?

So, either both dvc.yaml and dvc.lock are read together or we could just say dvc.yaml is read (and consider dvc.lock as an implementation detail) ..

we come back to a single term "DVC files" w/o specifying details with some tooltip that mentions that DVC files == dvc.yaml + dvc.lock + .dvc . All of those files play their role here. We can expand later and bit explicit if needed where it is needed.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc.yaml is mentioned just before: "targets ... limit what to pull. It accepts ... and stage names (found in dvc.yaml)". Then, when we say "push checks the appropriate dvc.lock files", by "appropriate" we mean corresponding to what was found in dvc.yaml. But let me see if I can clarify this further... ⌛

just say dvc.yaml is read (and consider dvc.lock as an implementation detail

This is not a bad idea... We do go over implementation details in cmd refs though, in fact they're probably the lowest-level docs we have. But we could def. hide some of those details in expandable sections.
I'm just not sure dvc.lock should be considered a low-level detail; It's pretty visible to the user.

we come back to a single term "DVC files" w/o specifying details with some tooltip that mentions that DVC files

Yes, let's try to merge this one as best we can though, and follow up on that in #1663 ?

See also iterative/dvc/issues/4393.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see if I can clarify this further... ⌛

I turn this whole bullet list into a more simple paragraph in dc87fe5. PTAL

<abbr>workspace</abbr>.
- For each <abbr>output</abbr> referenced in each stage or `.dvc` file, DVC
finds a corresponding file or directory in the <abbr>cache</abbr>. DVC then
gathers a list of files missing from the remote storage.
- The cached files missing from remote storage, if any, are uploaded.

The `dvc status -c` command can list files tracked by DVC that are new in the
cache (compared to the default remote.) It can be used to see what files
Note that the `dvc status -c` command can list files tracked by DVC that are new
in the cache (compared to the default remote.) It can be used to see what files
`dvc push` would upload.

The `targets` given to this command (if any) limit what to push. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).

## Options

- `-a`, `--all-branches` - determines the files to upload by examining
Expand Down
4 changes: 3 additions & 1 deletion content/docs/command-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ differences, the changes in <abbr>dependencies</abbr> and/or
<abbr>outputs</abbr> that differ are listed. For each item listed, either the
file name or hash is shown, along with a _state description_, as detailed below:

### Local workspace status

- _changed checksum_ means that the `.dvc` file hash has changed (e.g. someone
manually edited it).

Expand Down Expand Up @@ -77,7 +79,7 @@ file name or hash is shown, along with a _state description_, as detailed below:
original data source has changed). The imported data can be brought to its
latest version by using `dvc update`.

**For comparison against remote storage:**
### Comparison against remote storage

- _new_ means that the file/directory exists in the cache but not in remote
storage.
Expand Down