Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

Commit

Permalink
Merge pull request #35 from iterative/iesahin/issue28
Browse files Browse the repository at this point in the history
Fixes for Data Access Scenario
  • Loading branch information
iesahin authored Mar 9, 2021
2 parents ef5b505 + e7fac80 commit 9209c3c
Show file tree
Hide file tree
Showing 13 changed files with 158 additions and 163 deletions.
39 changes: 31 additions & 8 deletions get-started/accessing/01-download.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,42 @@
# Download

Let's first get/download any file that was added to DVC:

> You don't need to be inside a Git or DVC repo to execute it
We can download any file in a DVC repository:

```
dvc get \
https://github.com/iterative/dataset-registry \
get-started/data.xml
```{{execute}}
`ls data.xml`{{execute}}
`md5sum data.xml`{{execute}}
Here we see that instead of accessing data file directly (e.g. with `aws s3 cp`,
or `scp`, `wget`, etc) we are accessing it using a Git repo URL as an _entry
point_ or as a _data/model registry_.
`dvc get` automated this by reading `https://remote.dvc.org/dataset-registry`
from
[.dvc/config](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)
and `a3/04afb96060aad90176268345e10355` path from
[get-started/data.xml.dvc](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc).
Just for fun, let's try to download it with `wget`:
```
storage="https://remote.dvc.org/dataset-registry"
path="a3/04afb96060aad90176268345e10355"
wget -O data.xml.1 $storage/$path
```{{execute}}
Check whether they are the same file:
`diff data.xml data.xml.1`{{execute}}
Instead of downloading the data file directly, e.g., with `aws s3 cp`, `scp`,
`wget`, we are accessing it using a Git repo URL as an _entry point_ or as
a [_data/model registry_][data-registries].
[data-registries]: https://dvc.org/doc/use-cases/data-registries
By the way, we didn't initialize DVC in the current directory yet. `dvc get`
doesn't need an initialized project.
Let's initialize DVC now.
`dvc init`{{execute}}
20 changes: 20 additions & 0 deletions get-started/accessing/02-discovering-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Discovering files

As we mentioned, if you look at the [repository][dr], you won't see
`data/data.xml` or `model.pkl`, or any DVC-tracked files. They are not stored
in Git. We can `dvc get` them, but how do we even know what data is tracked in a
remote DVC repo before accessing it?

[dr]: https://github.com/iterative/dataset-registry

If `dvc get` is the analog of `wget` or `curl`, then `dvc list` is the analog
of `ls` or `aws s3 ls` and similar commands:

```
dvc list \
https://github.com/iterative/example-get-started \
data/
```{{execute}}
The only difference is that we pass a Git URL. Same interface as `dvc get`. Now
we can see `data.xml` as well as regular Git files.
29 changes: 0 additions & 29 deletions get-started/accessing/02-how-does-it-work.md

This file was deleted.

21 changes: 0 additions & 21 deletions get-started/accessing/03-discovering-files.md

This file was deleted.

32 changes: 32 additions & 0 deletions get-started/accessing/03-python-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Python API

Besides using DVC commands in the command line, we can also access any
DVC-tracked artifact "natively" from Python with
[the API](https://dvc.org/doc/api-reference):

`process.py`{{open}}

The script downloads the data like `dvc get` and counts the number of lines in it:

`python3 process.py`{{execute}}

The interface of [`dvc.api.open`][apiopen] is similar to the one we've
seen already. It receives Git repo URL and path as arguments, and works
the same way. There are also a few important differences:

[apiopen]: https://dvc.org/doc/api-reference/open

- **It's Python "native"**, we don't have to combine CLI scripts with Python
code to process data or deploy a model.

- **It doesn't consume space for a file on the file system** - `open()` doesn't
consume space in the file system - it loads the data into the memory as
needed. If you want to process a large dataset or deploy a huge model you
don't have to keep it on the disk.

- **It reads data lazily** - it doesn't allocate a huge array internally to hold
the data, instead it streams it from the remote storage. This means you can
process a huge dataset with a very low memory footprint.

- **It unifies storage access** - it doesn't matter if actual data is stored on
S3, or Google Cloud, or SSH - the interface is the same.
32 changes: 0 additions & 32 deletions get-started/accessing/04-python-api.md

This file was deleted.

36 changes: 36 additions & 0 deletions get-started/accessing/04-reusing-data-or-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Reusing (importing) data or models

Modern programming languages have a way to package, distribute, and reuse code
as libraries (in the Python world - [PyPI](https://pypi.org/), `pip`, `conda`
are some well-known examples). It's an important feature that significantly
simplifies managing the complexity of the development process.

What about datasets and ML models?

A DVC repository and the `dvc import` command are enough to export data and models,
reuse them, track upstream changes, etc. Let's give it a try:

```
dvc import \
https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
```{{execute}}
`dvc import` command creates `data/data.xml.dvc` to track the dependency. You
can view this file in the editor:
`data/data.xml.dvc`{{open}}
The `url` and `rev_lock` subfields under `repo` are used to save the origin and
the version of the dependency, respectively:
The effect of using `dvc import` is similar to running `dvc get` + `dvc add`,
but the resulting `.dvc` file includes metadata to track changes in the source
repository. This allows you to bring in changes from the data source later,
using:
`dvc update data/data.xml.dvc`{{execute}}
In this case, everything is up to date, but if someone creates a new version of
`data.xml` in the data registry, this command can detect the change and update the
`data/data.xml` file accordingly.
37 changes: 0 additions & 37 deletions get-started/accessing/05-reusing-data-or-models.md

This file was deleted.

17 changes: 17 additions & 0 deletions get-started/accessing/06-congrats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Congratulations!

In this scenario, we learned how to access data registries via DVC. We can
download model and data files with `dvc get` or import them to DVC repositories
with `dvc import`. DVC also has an API that streams large files directly into
the memory with `dvc.api.open`.

Our vision is to have a central registry for all the data and model files and
using them in different projects. It's based on Git, and provides flexibility
without requiring additional infrastructure.

<p align="center">
<img src="/dvc/courses/get-started/accessing/assets/data-registry.png">
</p>

If you want to read more about the workflow, refer to the
[Data Registries](https://dvc.org/doc/use-cases/data-registries) use case.
22 changes: 0 additions & 22 deletions get-started/accessing/06-data-model-and-artifact.md

This file was deleted.

7 changes: 4 additions & 3 deletions get-started/accessing/index.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
},
{
"title": "Step 6",
"text": "06-data-model-and-artifact.md"
"text": "06-congrats.md"
}
],
"intro": {
Expand All @@ -47,9 +47,10 @@
}
},
"environment": {
"uilayout": "terminal"
"uieditorpath": "/root/project",
"uilayout": "vscode-terminal-split"
},
"backend": {
"imageid": "ubuntu:2004"
}
}
}
13 changes: 8 additions & 5 deletions get-started/accessing/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,14 @@ until hash dvc &>/dev/null; do sleep 1; done
# enable bash completion
source /etc/bash_completion

git clone --branch 3-config-remote \
https://github.com/iterative/example-get-started.git
cd example-get-started/
git reset –-hard 3-config-remote
cd ..
# git clone --branch 3-config-remote \
# https://github.com/iterative/example-get-started.git
# cd example-get-started/
# git reset –-hard 3-config-remote
# cd ..

git init project
cd project

# clear screen
clear
Expand Down
16 changes: 10 additions & 6 deletions get-started/accessing/install.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
#!/bin/bash

apt install --yes highlight virtualenv
apt install --yes highlight

# install dvc
sudo wget https://dvc.org/deb/dvc.list \
-O /etc/apt/sources.list.d/dvc.list
sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \
-o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
sudo apt install dvc
# sudo wget https://dvc.org/deb/dvc.list \
# -O /etc/apt/sources.list.d/dvc.list
# sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \
# -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
# sudo apt install dvc

# installing from pip is faster

pip3 install dvc

# install bash completion for dvc
dvc completion -s bash > /etc/bash_completion.d/dvc

0 comments on commit 9209c3c

Please sign in to comment.