Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: document import-url cloud versioning changes #4142

Merged
merged 7 commits into from
Dec 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,20 @@ DVC supports several types of external locations (protocols):
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) is
necessary to track if the specified URL changed.

DVC also supports capturing cloud versioning information when importing data
from certain cloud storage providers. When the `--version-aware` option is
provided or when the `url` argument includes a supported cloud versioning ID,
DVC will import the specified version of the given data. When using versioned
storage, DVC will always [pull](/doc/command-reference/pull) the versioned data
from its original source location. Versioned data will also not be
[pushed](/doc/command-reference/push) to remote storage.

| Type | Description | Versioned `url` format example |
| ------- | ---------------------------- | ------------------------------------------------------ |
| `s3` | Amazon S3 | `s3://bucket/data?versionId=L4kqtJlcpXroDTDmpUMLUo` |
| `azure` | Microsoft Azure Blob Storage | `azure://container/data?versionid=YYYY-MM-DDThh:mm:ss` |
| `gs` | Google Cloud Storage | `gs://bucket/data#1360887697105000` |

Another way to understand the `dvc import-url` command is as a shortcut for
generating a pipeline [stage](/doc/command-reference/run) with an external
dependency.
Expand Down Expand Up @@ -179,6 +193,12 @@ produces a regular stage in `dvc.yaml`.

- `-h`, `--help` - prints the usage/help message, and exit.

- `--version-aware` - capture cloud versioning information when importing the
file. By default, DVC will automatically capture cloud versioning information
if the URL contains a cloud versioning ID. When `--version-aware` is provided
along with a URL that does not contain a cloud versioning ID, DVC will capture
the latest version of the file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also explain that dvc will pull that version from the source location even if it's overwritten, and will not push another copy of it to the remote.

cc @jorgeorpinel Is there somewhere in the data management user guide we want to this info also?

Copy link
Contributor

@jorgeorpinel jorgeorpinel Dec 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll def. need UG updates to go over cloud versioning (feel free to make a separate docs issue) -- can't explain everything in an option text. For now I'd focus on what the flag does, and put some explanations in the Description (which in this case is already super long and should be rewritten/ moved to UG eventually).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

Expand Down
5 changes: 5 additions & 0 deletions content/docs/command-reference/update.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ $ dvc update --rev master
> Note that this changes the `rev` field in the import stage, fixing it to the
> revision.

For stages created with `dvc import-url` and a
[cloud-versioned URL](/doc/command-reference/import-url#--version-aware),
`--rev` can be used to specify a object version ID to use. By default, the
import will be updated to the latest version from cloud storage.

- `-R`, `--recursive` - determines the files to update by searching each target
directory and its subdirectories for import `.dvc` files to inspect. If there
are no directories among the targets, this option has no effect.
Expand Down
1 change: 1 addition & 0 deletions content/docs/user-guide/project-structure/dvc-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ The following subfields may be present under `outs` entries:
| `type` | User-assigned type of the data. |
| `labels` | User-assigned labels to add to the data. |
| `meta` | Custom metadata about the data. |
| `push` | Whether or not this file or directory, when previously <abbr>cached</abbr>, is uploaded to remote storage by `dvc push` (`true` by default). |

## Dependency entries

Expand Down
1 change: 1 addition & 0 deletions content/docs/user-guide/project-structure/dvcyaml-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ These include a subset of the fields in `.dvc` file
| `persist` | Whether the output file/dir should remain in place during `dvc repro` (`false` by default: outputs are deleted when `dvc repro` starts) |
| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [checkpoint experiments](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. |
| `desc` | (Optional) User description for this output. This doesn't affect any DVC operations. |
| `push` | Whether or not this file or directory, when previously <abbr>cached</abbr>, is uploaded to remote storage by `dvc push` (`true` by default). |
Copy link
Contributor

@jorgeorpinel jorgeorpinel Dec 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing iterative/dvc#8581 (comment):

Should we plan to recommend this a lot in Data Pipeline docs? Specifically for intermediate pipeline outputs. Assuming the happy path out there is to push only raw data and likely final ML model files (everything else may be best to dvc repro when needed).

If we don't at least emphasize the possibility, users may realize too late they have pushed a bunch of intermediate output versions and they are pretty difficult to clean up with dvc gc (support example).

Copy link
Contributor Author

@pmrowla pmrowla Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that not pushing is the right default behavior, even for intermediate outputs. If the user wants to take advantage of run-cache to not re-run stages that have already been reproduced, they still need to push/pull intermediate outs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Thinking about it some more, I like the suggestion and think it makes sense as a possible product direction to make it easier to get started with pipelines, so let's brainstorm more on it.


<admon type="warn">

Expand Down