Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

Fix Accessing Scenario #35

Merged
merged 10 commits into from
Mar 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 31 additions & 8 deletions get-started/accessing/01-download.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,42 @@
# Download

Let's first get/download any file that was added to DVC:

> You don't need to be inside a Git or DVC repo to execute it
We can download any file in a DVC repository:

```
dvc get \
https://github.com/iterative/dataset-registry \
get-started/data.xml
```{{execute}}

`ls data.xml`{{execute}}

`md5sum data.xml`{{execute}}

Here we see that instead of accessing data file directly (e.g. with `aws s3 cp`,
or `scp`, `wget`, etc) we are accessing it using a Git repo URL as an _entry
point_ or as a _data/model registry_.
`dvc get` automated this by reading `https://remote.dvc.org/dataset-registry`
from
[.dvc/config](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)
and `a3/04afb96060aad90176268345e10355` path from
[get-started/data.xml.dvc](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc).

Just for fun, let's try to download it with `wget`:

```
storage="https://remote.dvc.org/dataset-registry"
path="a3/04afb96060aad90176268345e10355"
wget -O data.xml.1 $storage/$path
```{{execute}}

Check whether they are the same file:

`diff data.xml data.xml.1`{{execute}}

Instead of downloading the data file directly, e.g., with `aws s3 cp`, `scp`,
`wget`, we are accessing it using a Git repo URL as an _entry point_ or as
a [_data/model registry_][data-registries].

[data-registries]: https://dvc.org/doc/use-cases/data-registries

By the way, we didn't initialize DVC in the current directory yet. `dvc get`
doesn't need an initialized project.

Let's initialize DVC now.

`dvc init`{{execute}}
20 changes: 20 additions & 0 deletions get-started/accessing/02-discovering-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Discovering files

As we mentioned, if you look at the [repository][dr], you won't see
`data/data.xml` or `model.pkl`, or any DVC-tracked files. They are not stored
in Git. We can `dvc get` them, but how do we even know what data is tracked in a
remote DVC repo before accessing it?

[dr]: https://github.com/iterative/dataset-registry

If `dvc get` is the analog of `wget` or `curl`, then `dvc list` is the analog
of `ls` or `aws s3 ls` and similar commands:

```
dvc list \
https://github.com/iterative/example-get-started \
data/
```{{execute}}
The only difference is that we pass a Git URL. Same interface as `dvc get`. Now
we can see `data.xml` as well as regular Git files.
29 changes: 0 additions & 29 deletions get-started/accessing/02-how-does-it-work.md

This file was deleted.

21 changes: 0 additions & 21 deletions get-started/accessing/03-discovering-files.md

This file was deleted.

32 changes: 32 additions & 0 deletions get-started/accessing/03-python-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Python API

Besides using DVC commands in the command line, we can also access any
DVC-tracked artifact "natively" from Python with
[the API](https://dvc.org/doc/api-reference):

`process.py`{{open}}

The script downloads the data like `dvc get` and counts the number of lines in it:

`python3 process.py`{{execute}}

The interface of [`dvc.api.open`][apiopen] is similar to the one we've
seen already. It receives Git repo URL and path as arguments, and works
the same way. There are also a few important differences:

[apiopen]: https://dvc.org/doc/api-reference/open

- **It's Python "native"**, we don't have to combine CLI scripts with Python
code to process data or deploy a model.

- **It doesn't consume space for a file on the file system** - `open()` doesn't
consume space in the file system - it loads the data into the memory as
needed. If you want to process a large dataset or deploy a huge model you
don't have to keep it on the disk.

- **It reads data lazily** - it doesn't allocate a huge array internally to hold
the data, instead it streams it from the remote storage. This means you can
process a huge dataset with a very low memory footprint.

- **It unifies storage access** - it doesn't matter if actual data is stored on
S3, or Google Cloud, or SSH - the interface is the same.
32 changes: 0 additions & 32 deletions get-started/accessing/04-python-api.md

This file was deleted.

36 changes: 36 additions & 0 deletions get-started/accessing/04-reusing-data-or-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Reusing (importing) data or models

Modern programming languages have a way to package, distribute, and reuse code
as libraries (in the Python world - [PyPI](https://pypi.org/), `pip`, `conda`
are some well-known examples). It's an important feature that significantly
simplifies managing the complexity of the development process.

What about datasets and ML models?

A DVC repository and the `dvc import` command are enough to export data and models,
reuse them, track upstream changes, etc. Let's give it a try:

```
dvc import \
https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
```{{execute}}
`dvc import` command creates `data/data.xml.dvc` to track the dependency. You
can view this file in the editor:
`data/data.xml.dvc`{{open}}
The `url` and `rev_lock` subfields under `repo` are used to save the origin and
the version of the dependency, respectively:
The effect of using `dvc import` is similar to running `dvc get` + `dvc add`,
but the resulting `.dvc` file includes metadata to track changes in the source
repository. This allows you to bring in changes from the data source later,
using:
`dvc update data/data.xml.dvc`{{execute}}
In this case, everything is up to date, but if someone creates a new version of
`data.xml` in the data registry, this command can detect the change and update the
`data/data.xml` file accordingly.
37 changes: 0 additions & 37 deletions get-started/accessing/05-reusing-data-or-models.md

This file was deleted.

17 changes: 17 additions & 0 deletions get-started/accessing/06-congrats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Congratulations!

In this scenario, we learned how to access data registries via DVC. We can
download model and data files with `dvc get` or import them to DVC repositories
with `dvc import`. DVC also has an API that streams large files directly into
the memory with `dvc.api.open`.

Our vision is to have a central registry for all the data and model files and
using them in different projects. It's based on Git, and provides flexibility
without requiring additional infrastructure.

<p align="center">
<img src="/dvc/courses/get-started/accessing/assets/data-registry.png">
</p>

If you want to read more about the workflow, refer to the
[Data Registries](https://dvc.org/doc/use-cases/data-registries) use case.
22 changes: 0 additions & 22 deletions get-started/accessing/06-data-model-and-artifact.md

This file was deleted.

7 changes: 4 additions & 3 deletions get-started/accessing/index.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
},
{
"title": "Step 6",
"text": "06-data-model-and-artifact.md"
"text": "06-congrats.md"
}
],
"intro": {
Expand All @@ -47,9 +47,10 @@
}
},
"environment": {
"uilayout": "terminal"
"uieditorpath": "/root/project",
"uilayout": "vscode-terminal-split"
},
"backend": {
"imageid": "ubuntu:2004"
}
}
}
13 changes: 8 additions & 5 deletions get-started/accessing/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,14 @@ until hash dvc &>/dev/null; do sleep 1; done
# enable bash completion
source /etc/bash_completion

git clone --branch 3-config-remote \
https://github.com/iterative/example-get-started.git
cd example-get-started/
git reset –-hard 3-config-remote
cd ..
# git clone --branch 3-config-remote \
# https://github.com/iterative/example-get-started.git
# cd example-get-started/
# git reset –-hard 3-config-remote
# cd ..

git init project
cd project

# clear screen
clear
Expand Down
16 changes: 10 additions & 6 deletions get-started/accessing/install.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
#!/bin/bash

apt install --yes highlight virtualenv
apt install --yes highlight

# install dvc
sudo wget https://dvc.org/deb/dvc.list \
-O /etc/apt/sources.list.d/dvc.list
sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \
-o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
sudo apt install dvc
# sudo wget https://dvc.org/deb/dvc.list \
# -O /etc/apt/sources.list.d/dvc.list
# sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \
# -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
# sudo apt install dvc

# installing from pip is faster

pip3 install dvc

# install bash completion for dvc
dvc completion -s bash > /etc/bash_completion.d/dvc