This repository has been archived by the owner on Jul 5, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #35 from iterative/iesahin/issue28
Fixes for Data Access Scenario
- Loading branch information
Showing
13 changed files
with
158 additions
and
163 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,42 @@ | ||
# Download | ||
|
||
Let's first get/download any file that was added to DVC: | ||
|
||
> You don't need to be inside a Git or DVC repo to execute it | ||
We can download any file in a DVC repository: | ||
|
||
``` | ||
dvc get \ | ||
https://github.com/iterative/dataset-registry \ | ||
get-started/data.xml | ||
```{{execute}} | ||
`ls data.xml`{{execute}} | ||
`md5sum data.xml`{{execute}} | ||
Here we see that instead of accessing data file directly (e.g. with `aws s3 cp`, | ||
or `scp`, `wget`, etc) we are accessing it using a Git repo URL as an _entry | ||
point_ or as a _data/model registry_. | ||
`dvc get` automated this by reading `https://remote.dvc.org/dataset-registry` | ||
from | ||
[.dvc/config](https://github.com/iterative/dataset-registry/blob/master/.dvc/config) | ||
and `a3/04afb96060aad90176268345e10355` path from | ||
[get-started/data.xml.dvc](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc). | ||
Just for fun, let's try to download it with `wget`: | ||
``` | ||
storage="https://remote.dvc.org/dataset-registry" | ||
path="a3/04afb96060aad90176268345e10355" | ||
wget -O data.xml.1 $storage/$path | ||
```{{execute}} | ||
Check whether they are the same file: | ||
`diff data.xml data.xml.1`{{execute}} | ||
Instead of downloading the data file directly, e.g., with `aws s3 cp`, `scp`, | ||
`wget`, we are accessing it using a Git repo URL as an _entry point_ or as | ||
a [_data/model registry_][data-registries]. | ||
[data-registries]: https://dvc.org/doc/use-cases/data-registries | ||
By the way, we didn't initialize DVC in the current directory yet. `dvc get` | ||
doesn't need an initialized project. | ||
Let's initialize DVC now. | ||
`dvc init`{{execute}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Discovering files | ||
|
||
As we mentioned, if you look at the [repository][dr], you won't see | ||
`data/data.xml` or `model.pkl`, or any DVC-tracked files. They are not stored | ||
in Git. We can `dvc get` them, but how do we even know what data is tracked in a | ||
remote DVC repo before accessing it? | ||
|
||
[dr]: https://github.com/iterative/dataset-registry | ||
|
||
If `dvc get` is the analog of `wget` or `curl`, then `dvc list` is the analog | ||
of `ls` or `aws s3 ls` and similar commands: | ||
|
||
``` | ||
dvc list \ | ||
https://github.com/iterative/example-get-started \ | ||
data/ | ||
```{{execute}} | ||
The only difference is that we pass a Git URL. Same interface as `dvc get`. Now | ||
we can see `data.xml` as well as regular Git files. |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Python API | ||
|
||
Besides using DVC commands in the command line, we can also access any | ||
DVC-tracked artifact "natively" from Python with | ||
[the API](https://dvc.org/doc/api-reference): | ||
|
||
`process.py`{{open}} | ||
|
||
The script downloads the data like `dvc get` and counts the number of lines in it: | ||
|
||
`python3 process.py`{{execute}} | ||
|
||
The interface of [`dvc.api.open`][apiopen] is similar to the one we've | ||
seen already. It receives Git repo URL and path as arguments, and works | ||
the same way. There are also a few important differences: | ||
|
||
[apiopen]: https://dvc.org/doc/api-reference/open | ||
|
||
- **It's Python "native"**, we don't have to combine CLI scripts with Python | ||
code to process data or deploy a model. | ||
|
||
- **It doesn't consume space for a file on the file system** - `open()` doesn't | ||
consume space in the file system - it loads the data into the memory as | ||
needed. If you want to process a large dataset or deploy a huge model you | ||
don't have to keep it on the disk. | ||
|
||
- **It reads data lazily** - it doesn't allocate a huge array internally to hold | ||
the data, instead it streams it from the remote storage. This means you can | ||
process a huge dataset with a very low memory footprint. | ||
|
||
- **It unifies storage access** - it doesn't matter if actual data is stored on | ||
S3, or Google Cloud, or SSH - the interface is the same. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Reusing (importing) data or models | ||
|
||
Modern programming languages have a way to package, distribute, and reuse code | ||
as libraries (in the Python world - [PyPI](https://pypi.org/), `pip`, `conda` | ||
are some well-known examples). It's an important feature that significantly | ||
simplifies managing the complexity of the development process. | ||
|
||
What about datasets and ML models? | ||
|
||
A DVC repository and the `dvc import` command are enough to export data and models, | ||
reuse them, track upstream changes, etc. Let's give it a try: | ||
|
||
``` | ||
dvc import \ | ||
https://github.com/iterative/dataset-registry \ | ||
get-started/data.xml -o data/data.xml | ||
```{{execute}} | ||
`dvc import` command creates `data/data.xml.dvc` to track the dependency. You | ||
can view this file in the editor: | ||
`data/data.xml.dvc`{{open}} | ||
The `url` and `rev_lock` subfields under `repo` are used to save the origin and | ||
the version of the dependency, respectively: | ||
The effect of using `dvc import` is similar to running `dvc get` + `dvc add`, | ||
but the resulting `.dvc` file includes metadata to track changes in the source | ||
repository. This allows you to bring in changes from the data source later, | ||
using: | ||
`dvc update data/data.xml.dvc`{{execute}} | ||
In this case, everything is up to date, but if someone creates a new version of | ||
`data.xml` in the data registry, this command can detect the change and update the | ||
`data/data.xml` file accordingly. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Congratulations! | ||
|
||
In this scenario, we learned how to access data registries via DVC. We can | ||
download model and data files with `dvc get` or import them to DVC repositories | ||
with `dvc import`. DVC also has an API that streams large files directly into | ||
the memory with `dvc.api.open`. | ||
|
||
Our vision is to have a central registry for all the data and model files and | ||
using them in different projects. It's based on Git, and provides flexibility | ||
without requiring additional infrastructure. | ||
|
||
<p align="center"> | ||
<img src="/dvc/courses/get-started/accessing/assets/data-registry.png"> | ||
</p> | ||
|
||
If you want to read more about the workflow, refer to the | ||
[Data Registries](https://dvc.org/doc/use-cases/data-registries) use case. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,17 @@ | ||
#!/bin/bash | ||
|
||
apt install --yes highlight virtualenv | ||
apt install --yes highlight | ||
|
||
# install dvc | ||
sudo wget https://dvc.org/deb/dvc.list \ | ||
-O /etc/apt/sources.list.d/dvc.list | ||
sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \ | ||
-o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0" | ||
sudo apt install dvc | ||
# sudo wget https://dvc.org/deb/dvc.list \ | ||
# -O /etc/apt/sources.list.d/dvc.list | ||
# sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \ | ||
# -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0" | ||
# sudo apt install dvc | ||
|
||
# installing from pip is faster | ||
|
||
pip3 install dvc | ||
|
||
# install bash completion for dvc | ||
dvc completion -s bash > /etc/bash_completion.d/dvc |