Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more 1.x updates... (import) #2094

Merged
merged 3 commits into from
Jan 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 38 additions & 37 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# import-url

Download a file or directory from a supported URL (for example `s3://`,
`ssh://`, and other protocols) into the <abbr>workspace</abbr>, and track
changes in the remote data source. Creates a `.dvc` file.
`ssh://`, and other protocols) into the <abbr>workspace</abbr>, and track it (an
import `.dvc` file is created).
Comment on lines 1 to +5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For import-url -h


> See `dvc import` to download and tack data/model files or directories from
> other <abbr>DVC repositories</abbr> (e.g. hosted on GitHub).
Expand All @@ -21,39 +21,45 @@ positional arguments:

## Description

In some cases it's convenient to add a data file or directory from a remote
In some cases it's convenient to add a data file or directory from an external
location into the workspace, such that it can be updated later, if/when the
external data source changes. Example scenarios:

- A remote system may produce occasional data files that are used in other
projects.
- A batch process running regularly updates a data file to import.
- A shared dataset on a remote storage that is managed and updated outside DVC.
- A shared dataset on cloud storage that is managed and updated outside DVC.

> Note that `dvc get-url` corresponds to the first step this command performs
> (just download the file or directory).

The `dvc import-url` command helps the user create such an external data
dependency without having to manually copying files from the supported remote
locations (listed below), which may require installing a different tool for each
type.
`dvc import-url` helps you create such an external data dependency, without
having to manually copy files from the supported locations (listed below), which
may require installing a different tool for each type.

The `url` argument specifies the external location of the data to be imported,
while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, the file or
directory will be placed inside.
The `url` argument specifies the external location of the data to be imported.
The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
working directory with its original file name e.g. `data.txt` (or to a location
provided with `out`).

An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` –
similar to using `dvc add` after downloading the data. This makes it possible to
update the import later, if the data source has changed (see `dvc update`).

> Note that the imported data can be [pushed](/doc/command-reference/push) to
> remote storage normally.

`.dvc` files support references to data in an external location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such an
import `.dvc` file, the `deps` field stores the remote URL, and the `outs` field
contains the corresponding local path in the <abbr>workspace</abbr>. It records
enough metadata about the imported data to enable DVC efficiently determining
whether the local copy is out of date.
import `.dvc` file, the `deps` field stores the external URL, and the `outs`
field contains the corresponding local path in the <abbr>workspace</abbr>. It
records enough metadata about the imported data to enable DVC efficiently
determining whether the local copy is out of date.

Note that `dvc repro` doesn't check or update import `.dvc` files, use
`dvc update` to bring the import up to date from the data source.

DVC supports several types of (local or) remote locations (protocols):
DVC supports several types of external locations (protocols):

| Type | Description | `url` format example |
| --------- | ---------------------------- | --------------------------------------------- |
Expand Down Expand Up @@ -82,8 +88,7 @@ DVC supports several types of (local or) remote locations (protocols):

- In case of HTTP,
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) is
necessary to track if the specified remote file (URL) changed to download it
again.
necessary to track if the specified URL changed.

- `remote://myremote/path/to/file` notation just means that a DVC
[remote](/doc/command-reference/remote) `myremote` is defined and when DVC is
Expand All @@ -110,12 +115,8 @@ $ dvc run -n download_data \
wget https://data.dvc.org/get-started/data.xml -O data.xml
```

`dvc import-url` generates an _import stage_ `.dvc` file and `dvc run` a regular
stage (in `dvc.yaml`).

⚠️ DVC won't push or pull imported data to/from
[remote storage](/doc/command-reference/remote), it will rely on it's original
source.
`dvc import-url` generates an _import `.dvc` file_ and `dvc run` a regular stage
(in `dvc.yaml`).

## Options

Expand Down Expand Up @@ -163,7 +164,7 @@ $ git checkout 3-config-remote

</details>

## Example: Tracking a remote file
## Example: Tracking a file from the web

An advanced alternate to the intro of the
[Versioning Basics](/doc/tutorials/get-started/data-versioning) part of the _Get
Expand Down Expand Up @@ -195,28 +196,28 @@ Let's take a look at the changes to the `data.xml.dvc`:

The `etag` field in the `.dvc` file contains the
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag) recorded from the HTTP request.
If the remote file changes, its ETag will be different. This metadata allows DVC
to determine whether it's necessary to download it again.
If the imported file changes online, its ETag will be different. This metadata
allows DVC to determine whether it's necessary to download it again.

> See `.dvc` files for more details on the format above.

You may want to get out of and remove the `example-get-started/` directory after
trying this example (especially if trying out the following one).

## Example: Detecting remote file changes
## Example: Detecting external file changes

What if that remote file is updated regularly? The project goals might include
regenerating some results based on the updated data source.
What if an imported file is updated regularly at it's source? The project goals
might include regenerating some results based on the updated data source.
[Pipeline](/doc/command-reference/dag) reproduction can be triggered based on a
changed external dependency.

Let's use the [Get Started](/doc/tutorials/get-started) project again,
simulating an updated external data source. (Remember to prepare the
<abbr>workspace</abbr>, as explained in [Examples](#examples))

To illustrate this scenario, let's use a local file system directory (external
to the workspace) to simulate a remote data source location. (In real life, the
data file will probably be on a remote server.) Run these commands:
To illustrate this scenario, let's use a local file system directory external to
the workspace (in real life, the data file could be on a remote server instead).
Run these commands:

```dvc
$ mkdir /tmp/dvc-import-url-example
Expand Down Expand Up @@ -319,15 +320,15 @@ Data and pipelines are up to date.

In the data store directory, edit `data.xml`. It doesn't matter what you change,
as long as it remains a valid XML file, because any change will result in a
different dependency file hash (`md5`) in the import stage `.dvc` file. Once we
do so, we can run `dvc update` to make sure the import is up to date:
different dependency file hash (`md5`) in the import `.dvc` file. Once we do so,
we can run `dvc update` to make sure the import is up to date:

```dvc
$ dvc update data.xml.dvc
Importing '.../tmp/dvc-import-url-example/data.xml' -> 'data/data.xml'
```

DVC notices the "external" data source has changed, and updates the import stage
DVC notices the external data source has changed, and updates the `.dvc` file
(reproduces it). In this case it's also necessary to run `dvc repro` so that the
remaining pipeline results are also regenerated:

Expand Down
72 changes: 34 additions & 38 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# import

Download a file or directory tracked by DVC or by Git into the
<abbr>workspace</abbr>. It also creates a `.dvc` file with information about the
data source, which can later be used to [update](/doc/command-reference/update)
the import.
Download a file or directory tracked by another DVC or Git repository into the
<abbr>workspace</abbr>, and track it (an import `.dvc` file is created).
Comment on lines 1 to +4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For import -h


> See also our `dvc.api.open()` Python API function.

Expand All @@ -25,16 +23,25 @@ positional arguments:
Provides an easy way to reuse files or directories tracked in any <abbr>DVC
repository</abbr> (e.g. datasets, intermediate results, ML models) or Git
repository (e.g. code, small image/other files). `dvc import` downloads the
target file or directory (found at `path` in `url`) into the workspace and
tracks it in the project. This makes it possible to update the import later, if
it has changed in its data source (see `dvc update`).
target file or directory (found at `path` in `url`), and tracks it in the local
project. This makes it possible to update the import later, if the data source
has changed (see `dvc update`).

> Note that `dvc get` corresponds to the first step this command performs (just
> download the data).

> See `dvc list` for a way to browse repository contents to find files or
> directories to import.

The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
working directory with its original file name e.g. `data.txt` (or to a location
provided with `--out`). An _import `.dvc` file_ is created in the same location
e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data.

⚠️ DVC won't push or pull data imported from other DVC repos to/from
[remote storage](/doc/command-reference/remote). It will rely on it's original
source.

The `url` argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
`[user@]server:project.git`). `url` can also be a local file system path
Expand All @@ -46,33 +53,22 @@ tracked by either Git or DVC (including paths inside tracked directories). Note
that DVC-tracked targets must be found in a `dvc.yaml` or `.dvc` file of the
repo.

⚠️ DVC repos should have a default [DVC remote](/doc/command-reference/remote)
containing the target actual for this command to work. The only exception is for
local repos, where DVC will try to copy the data from its <abbr>cache</abbr>
first.
⚠️ Source DVC repos should have a default
[DVC remote](/doc/command-reference/remote) containing the target data for this
command to work. The only exception is for local repos, where DVC will try to
copy the data from its <abbr>cache</abbr> first.

> See `dvc import-url` to download and track data from other supported locations
> such as S3, SSH, HTTP, etc.

After running this command successfully, the imported data is placed in the
current working directory (unless `-o` is used) with its original file name e.g.
`data.txt`. An _import stage_ (`.dvc` file) is also created in the same
location, extending the name of the imported data e.g. `data.txt.dvc` – similar
to having used `dvc run` to generate the data as a stage <abbr>output</abbr>.

`.dvc` files support references to data in an external DVC repository (hosted on
a Git server). In such a `.dvc` file, the `deps` field specifies the remote
`url` and data `path`, and the `outs` field contains the corresponding local
path in the <abbr>workspace</abbr>. It records enough metadata about the
imported data to enable DVC efficiently determining whether the local copy is
out of date.

⚠️ DVC won't push or pull imported data to/from
[remote storage](/doc/command-reference/remote), it will rely on it's original
source.
a Git server). In such a `.dvc` file, the `deps` field specifies the `url` and
data `path`, and the `outs` field contains the corresponding local path in the
<abbr>workspace</abbr>. It records enough metadata about the imported data to
enable DVC efficiently determining whether the local copy is out of date.

To actually [version the data](/doc/tutorials/get-started/data-versioning),
`git add` (and `git commit`) the import stage.
`git add` (and `git commit`) the import `.dvc` file.

Note that `dvc repro` doesn't check or update import `.dvc` files (see
`dvc freeze`), use `dvc update` to bring the import up to date from the data
Expand All @@ -98,8 +94,8 @@ repo at `url`) are not supported.
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.

> Note that this adds a `rev` field in the import stage that fixes it to the
> revision. This can impact the behavior of `dvc update` (see the
> Note that this adds a `rev` field in the import `.dvc` file that fixes it to
> the revision. This can impact the behavior of `dvc update` (see the
> [Importing and updating fixed revisions](#example-importing-and-updating-fixed-revisions)
> example below).

Expand Down Expand Up @@ -140,8 +136,8 @@ Importing 'data/data.xml ([email protected]:iterative/example-get-started)'
```

In contrast with `dvc get`, this command doesn't just download the data file,
but it also creates an import stage (`.dvc` file) with a link to the data source
(as explained in the description above). (This `.dvc` file can later be used to
but it also creates an import `.dvc` file with a link to the data source (as
explained in the description above). (This `.dvc` file can later be used to
[update](/doc/command-reference/update) the import.) Check `data.xml.dvc`:

```yaml
Expand Down Expand Up @@ -176,8 +172,8 @@ Importing
-> 'cats-dogs'
```

When using this option, the import stage (`.dvc` file) will also have a `rev`
subfield under `repo`:
When using this option, the import `.dvc` file will also have a `rev` subfield
under `repo`:

```yaml
deps:
Expand All @@ -192,14 +188,14 @@ If `rev` is a Git branch or tag (where the underlying commit changes), the data
source may have updates at a later time. To bring it up to date if so (and
update `rev_lock` in the `.dvc` file), simply use `dvc update <stage>.dvc`. If
`rev` is a specific commit hash (does not change), `dvc update` without options
will not have an effect on the import stage. You may force-update it to a
will not have an effect on the import `.dvc` file. You may force-update it to a
different commit with `dvc update --rev`:

```dvc
$ dvc update --rev cats-dogs-v2
```

> In the above example, the value for `rev` in the new import stage will be
> In the above example, the value for `rev` in the new `.dvc` file will be
> `master` (a branch) so it will be able update normally going forward.

## Example: Data registry
Expand Down Expand Up @@ -230,7 +226,7 @@ $ dvc import [email protected]:iterative/dataset-registry.git \
`dvc import` provides a better way to incorporate data files tracked in external
<abbr>DVC repositories</abbr> because it saves the connection between the
current project and the source repo. This means that enough information is
recorded in an import stage (`.dvc` file) in order to
recorded in an import `.dvc` file in order to
[reproduce](/doc/command-reference/repro) downloading of this same data version
in the future, where and when needed. This is achieved with the `repo` field,
for example (matching the import command above):
Expand Down Expand Up @@ -265,8 +261,8 @@ Importing ...

> Note that Git-tracked files can be imported from DVC repos as well.

The file is imported, and along with it, an import stage (`.dvc` file) is
created. Check `it-standards.csv.dvc`:
The file is imported, and along with it, an import `.dvc` file is created. Check
`it-standards.csv.dvc`:

```yaml
deps:
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
Expand Up @@ -799,7 +799,7 @@ by HDFS. Read more about by expanding the WebHDFS section in
> written to a Git-ignored config file.

> Note that `user/password` and `token` authentication are incompatible. You
> should authenticate against yout WebDAV remote by either `user/password` or
> should authenticate against your WebDAV remote by either `user/password` or
> `token`.

- `ask_password` - ask each time for the password to use for `user/password`
Expand Down