Skip to content

Commit

Permalink
Merge pull request #7 from iterative/data-registry
Browse files Browse the repository at this point in the history
Add dataset registry repo/project generator
  • Loading branch information
jorgeorpinel authored Aug 26, 2019
2 parents c37bf05 + b1bd569 commit d5d03f2
Show file tree
Hide file tree
Showing 6 changed files with 233 additions and 41 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ $ cd example-get-started
$ ./deploy.sh
```

<!-- ### dataset-registry -->
### dataset-registry

- `generate.sh`: Generates the `dataset-registry` DVC project from scratch. This
project is used by **example-get-started** below, so it should be generated
first.

### example-get-started

Expand Down
73 changes: 73 additions & 0 deletions dataset-registry/code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# DVC Dataset Registry

This is an auto-generated repository for use in https://dvc.org/doc/. Please
report any issues in its source project,
[example-repos-dev](https://github.com/iterative/example-repos-dev).

_Dataset Registry_ is a centralized place to manage raw data files for use in
other example DVC projects, such as
https://github.com/iterative/example-get-started.

## Installation

Start by cloning the project:

```console
$ git clone https://github.com/iterative/dataset-registry
$ cd dataset-registry
```

This DVC project comes with a preconfigured DVC
[remote storage](https://man.dvc.org/remote) to hold all of the datasets. This
is a read-only HTTP remote.

```console
$ dvc remote list
storage https://remote.dvc.org/dataset-registry
```

You can run [`dvc pull`](https://man.dvc.org/pull) to download specific datasets
locally:

```console
$ dvc pull -r storage get-started/data.xml
```

## Testing data synchronization locally

If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
that require write access to the remote storage, the easiest way would be to set
up a "local remote" on your file system:

> This kind of remote is located in the local file system, but is external to
> the DVC project.
```console
$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage
```

You should now be able to run:

```console
$ dvc push -r local
```

## Datasets

The folder structure of this project groups datasets corresponding to the
external projects they pertain to.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:


```console
$ tree
.
├── README.md
└── get-started
├── data.xml # Dataset used in iterative/example-get-started
└── data.xml.dvc

1 directory, 3 files
```
80 changes: 80 additions & 0 deletions dataset-registry/generate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/bin/sh

# Setup script env

# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# x Print commands and their arguments as they are executed.
set -eux

HERE="$( cd "$(dirname "$0")" ; pwd -P )"
REPO_NAME="dataset-registry"
REPO_PATH="$HERE/build/$REPO_NAME"

if [ -d "$REPO_PATH" ]; then
echo "Repo $REPO_PATH already exists, remove it first."
exit 1
fi

mkdir -p $REPO_PATH
pushd $REPO_PATH

# Create virtualenv, install `dvc`, initialize/config DVC project

virtualenv -p python3 .env
export VIRTUAL_ENV_DISABLE_PROMPT=true
source .env/bin/activate
echo '.env/' >> .gitignore

pip install dvc[s3]

git init
dvc init

# Remote active on this environment only for writing to HTTP redirect below.
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry

# Actual remote for generated project (read-only). Redirect of S3 bucket below.
dvc remote add -d storage https://remote.dvc.org/dataset-registry

cp $HERE/code/README.md $REPO_PATH

git add .
git commit -m "Init & config DVC project, add README"

# Get Started

mkdir get-started
wget https://data.dvc.org/get-started/data.xml -O get-started/data.xml
dvc add get-started/data.xml
git add get-started/.gitignore get-started/data.xml.dvc
git commit -m "Add Get Started dataset"
dvc push

# TODO: Gather more datasets!

popd

echo "`cat <<EOF-
The Git repo generated by this script is intended to be published on
https://github.com/iterative/dataset-registry. Make sure the Github repo
exists firt.
To create it with https://hub.github.com/ for example, run:
hub create iterative/dataset-registry -d "Get Started DVC project" \
-h "https://dvc.org/doc/get-started"
If the Github repo already exists, run these commands to rewrite it:
cd build/dataset-registry
git remote add origin [email protected]:iterative/dataset-registry.git
git push --force origin master
cd ../..
You may remove the generated repo with:
rm -fR build
`"
1 change: 1 addition & 0 deletions example-get-started/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Custom
*.zip
/tmp
build/
72 changes: 40 additions & 32 deletions example-get-started/code/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# DVC Get Started

This is an auto-generated repository for use in https://dvc.org/doc/get-started.
Please report any issues in
Please report any issues in its source project,
[example-repos-dev](https://github.com/iterative/example-repos-dev).

![](https://dvc.org/static/img/example-flow-2x.png)

_Get Started_ is a step by step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.

> Note that this project
[imports](https://dvc.org/doc/commands-reference/import) a dataset from
https://github.com/iterative/dataset-registry.

The idea of the project is a simplified version of the
[Tutorial](https://dvc.org/doc/tutorial). It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
Expand All @@ -34,8 +38,10 @@ $ source .env/bin/activate
$ pip install -r src/requirements.txt
```

This DVC project comes with a preconfigured remote DVC storage that has raw data
(input), intermediate, and final results that are produced.
This DVC project comes with a preconfigured DVC
[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw
data (input), intermediate, and final results that are produced. This is a
read-only HTTP remote.

```console
$ dvc remote list
Expand All @@ -48,7 +54,7 @@ You can run [`dvc pull`](https://man.dvc.org/pull) to download the data:
$ dvc pull -r storage
```

## Running in Your Environment
## Running in your environment

Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the
[pipeline](https://dvc.org/doc/commands-reference/pipeline):
Expand Down Expand Up @@ -80,52 +86,54 @@ You should now be able to run:
$ dvc push -r local
```

## Existing Stages
## Existing stages

This project with the help of the Git tags reflects the sequence of actions that
are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel
free to checkout one of them and play with the DVC commands having the
playground ready.

- `0-empty` - empty Git repository.
- `1-initialize` - DVC has been initialized. The `.dvc` with the cache directory
- `0-empty`: Empty Git repository initialized.
- `1-initialize`: DVC has been initialized. `.dvc/` with the cache directory
created.
- `2-remote` - remote HTTP storage initialized. It is a shared read only storage
- `2-remote`: Remote HTTP storage initialized. It's a shared read only storage
that contains all data artifacts produced during next steps.
- `3-add-file` - raw data file `data.xml` downloaded and put under DVC
control with [`dvc add`](https://man.dvc.org/add). First `.dvc` meta-file
created.
- `4-source` - source code downloaded and put under Git control.
- `5-preparation` - first DVC stage created using
- `3-add-file`: Raw data file `data.xml` downloaded and put under DVC control
with [`dvc add`](https://man.dvc.org/add). First DVC-file (`.dvc` file
extension) created.
- `4-source`: Source code downloaded and put under Git control.
- `5-preparation`: First stage file (DVC-file) created using
[`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.
- `6-featurization` - feature extraction step added. It also includes the split
step for simplicity. It takes data in TSV format and produces two `.pkl` files
that contain serialized feature matrices.
- `7-train` - the model training stage added. It produces `model.pkl` file - the
actual result that can be then deployed somewhere and classify questions.
- `8-evaluate` - evaluate stage, we run it on a test dataset to see the AUC
value for the model. The result is dumped into a DVC metric file so that we
can compare it with other experiments later.
- `9-bigrams` - bigrams experiment, code has been modified to extract more
- `6-featurization`: Feature extraction stage created. It takes data in TSV
format and produces two `.pkl` files that contain serialized feature matrices.
- `7-train`: Model training stage created. It produces `model.pkl` file – the
actual result that can then get deployed to an app that implements NLP
classification.
- `8-evaluate`: Evaluation stage. Runs the model on a test dataset to produce
its performance AUC value. The result is dumped into a DVC metric file so that
we can compare it with other experiments later.
- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more
features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time
to illustrate how DVC can reuse cached files and detect changes along the
computational graph.
computational graph, regenerating the model with the updated data.
- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
model.

There are two additional tags:

- `baseline-experiment` - the first end-to-end result that we performance metric
- `baseline-experiment`: First end-to-end result that we have performance metric
for.
- `bigrams-experiment` - second version of the experiment.
- `bigrams-experiment`: Second experiment (model trained using bigrams
features).

Both these tags could be used to illustrate `-a` or `-T` options across
different [DVC commands](https://man.dvc.org/).
These tags can be used to illustrate `-a` or `-T` options across different
[DVC commands](https://man.dvc.org/).

## Project Structure
## Project structure

The data files, DVC-files, and results change as stages are created one by one,
but right after you for Git clone and [`dvc pull`](https://man.dvc.org/pull) to
download files that are under DVC control, the structure of the project should
look like this:
The data files, DVC-files, and results change as stages are created one by one.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:

```console
$ tree
Expand Down
Loading

0 comments on commit d5d03f2

Please sign in to comment.