Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset registry repo/project generator #7

Merged
merged 4 commits into from
Aug 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ $ cd example-get-started
$ ./deploy.sh
```

<!-- ### dataset-registry -->
### dataset-registry

- `generate.sh`: Generates the `dataset-registry` DVC project from scratch. This
project is used by **example-get-started** below, so it should be generated
first.

### example-get-started

Expand Down
73 changes: 73 additions & 0 deletions dataset-registry/code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# DVC Dataset Registry

This is an auto-generated repository for use in https://dvc.org/doc/. Please
report any issues in its source project,
[example-repos-dev](https://github.com/iterative/example-repos-dev).

_Dataset Registry_ is a centralized place to manage raw data files for use in
other example DVC projects, such as
https://github.com/iterative/example-get-started.

## Installation

Start by cloning the project:

```console
$ git clone https://github.com/iterative/dataset-registry
$ cd dataset-registry
```

This DVC project comes with a preconfigured DVC
[remote storage](https://man.dvc.org/remote) to hold all of the datasets. This
is a read-only HTTP remote.

```console
$ dvc remote list
storage https://remote.dvc.org/dataset-registry
```

You can run [`dvc pull`](https://man.dvc.org/pull) to download specific datasets
locally:

```console
$ dvc pull -r storage get-started/data.xml
```

## Testing data synchronization locally

If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
that require write access to the remote storage, the easiest way would be to set
up a "local remote" on your file system:

> This kind of remote is located in the local file system, but is external to
> the DVC project.

```console
$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage
```

You should now be able to run:

```console
$ dvc push -r local
```

## Datasets

The folder structure of this project groups datasets corresponding to the
external projects they pertain to.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:


```console
$ tree
.
├── README.md
└── get-started
├── data.xml # Dataset used in iterative/example-get-started
└── data.xml.dvc

1 directory, 3 files
```
80 changes: 80 additions & 0 deletions dataset-registry/generate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/bin/sh

# Setup script env

# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# x Print commands and their arguments as they are executed.
set -eux

HERE="$( cd "$(dirname "$0")" ; pwd -P )"
REPO_NAME="dataset-registry"
REPO_PATH="$HERE/build/$REPO_NAME"

if [ -d "$REPO_PATH" ]; then
echo "Repo $REPO_PATH already exists, remove it first."
exit 1
fi

mkdir -p $REPO_PATH
pushd $REPO_PATH

# Create virtualenv, install `dvc`, initialize/config DVC project

virtualenv -p python3 .env
export VIRTUAL_ENV_DISABLE_PROMPT=true
source .env/bin/activate
echo '.env/' >> .gitignore

pip install dvc[s3]

git init
dvc init

# Remote active on this environment only for writing to HTTP redirect below.
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry

# Actual remote for generated project (read-only). Redirect of S3 bucket below.
dvc remote add -d storage https://remote.dvc.org/dataset-registry

cp $HERE/code/README.md $REPO_PATH

git add .
git commit -m "Init & config DVC project, add README"

# Get Started

mkdir get-started
wget https://data.dvc.org/get-started/data.xml -O get-started/data.xml
dvc add get-started/data.xml
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
git add get-started/.gitignore get-started/data.xml.dvc
git commit -m "Add Get Started dataset"
dvc push

# TODO: Gather more datasets!

popd

echo "`cat <<EOF-

The Git repo generated by this script is intended to be published on
https://github.com/iterative/dataset-registry. Make sure the Github repo
exists firt.

To create it with https://hub.github.com/ for example, run:

hub create iterative/dataset-registry -d "Get Started DVC project" \
-h "https://dvc.org/doc/get-started"

If the Github repo already exists, run these commands to rewrite it:

cd build/dataset-registry
git remote add origin [email protected]:iterative/dataset-registry.git
git push --force origin master
cd ../..

You may remove the generated repo with:

rm -fR build

`"
1 change: 1 addition & 0 deletions example-get-started/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Custom
*.zip
/tmp
build/
72 changes: 40 additions & 32 deletions example-get-started/code/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# DVC Get Started

This is an auto-generated repository for use in https://dvc.org/doc/get-started.
Please report any issues in
Please report any issues in its source project,
[example-repos-dev](https://github.com/iterative/example-repos-dev).

![](https://dvc.org/static/img/example-flow-2x.png)

_Get Started_ is a step by step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.

> Note that this project
[imports](https://dvc.org/doc/commands-reference/import) a dataset from
https://github.com/iterative/dataset-registry.

The idea of the project is a simplified version of the
[Tutorial](https://dvc.org/doc/tutorial). It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
Expand All @@ -34,8 +38,10 @@ $ source .env/bin/activate
$ pip install -r src/requirements.txt
```

This DVC project comes with a preconfigured remote DVC storage that has raw data
(input), intermediate, and final results that are produced.
This DVC project comes with a preconfigured DVC
[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw
data (input), intermediate, and final results that are produced. This is a
read-only HTTP remote.

```console
$ dvc remote list
Expand All @@ -48,7 +54,7 @@ You can run [`dvc pull`](https://man.dvc.org/pull) to download the data:
$ dvc pull -r storage
```

## Running in Your Environment
## Running in your environment

Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the
[pipeline](https://dvc.org/doc/commands-reference/pipeline):
Expand Down Expand Up @@ -80,52 +86,54 @@ You should now be able to run:
$ dvc push -r local
```

## Existing Stages
## Existing stages

This project with the help of the Git tags reflects the sequence of actions that
are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel
free to checkout one of them and play with the DVC commands having the
playground ready.

- `0-empty` - empty Git repository.
- `1-initialize` - DVC has been initialized. The `.dvc` with the cache directory
- `0-empty`: Empty Git repository initialized.
- `1-initialize`: DVC has been initialized. `.dvc/` with the cache directory
created.
- `2-remote` - remote HTTP storage initialized. It is a shared read only storage
- `2-remote`: Remote HTTP storage initialized. It's a shared read only storage
that contains all data artifacts produced during next steps.
- `3-add-file` - raw data file `data.xml` downloaded and put under DVC
control with [`dvc add`](https://man.dvc.org/add). First `.dvc` meta-file
created.
- `4-source` - source code downloaded and put under Git control.
- `5-preparation` - first DVC stage created using
- `3-add-file`: Raw data file `data.xml` downloaded and put under DVC control
with [`dvc add`](https://man.dvc.org/add). First DVC-file (`.dvc` file
extension) created.
- `4-source`: Source code downloaded and put under Git control.
- `5-preparation`: First stage file (DVC-file) created using
[`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.
- `6-featurization` - feature extraction step added. It also includes the split
step for simplicity. It takes data in TSV format and produces two `.pkl` files
that contain serialized feature matrices.
- `7-train` - the model training stage added. It produces `model.pkl` file - the
actual result that can be then deployed somewhere and classify questions.
- `8-evaluate` - evaluate stage, we run it on a test dataset to see the AUC
value for the model. The result is dumped into a DVC metric file so that we
can compare it with other experiments later.
- `9-bigrams` - bigrams experiment, code has been modified to extract more
- `6-featurization`: Feature extraction stage created. It takes data in TSV
format and produces two `.pkl` files that contain serialized feature matrices.
- `7-train`: Model training stage created. It produces `model.pkl` file – the
actual result that can then get deployed to an app that implements NLP
classification.
- `8-evaluate`: Evaluation stage. Runs the model on a test dataset to produce
its performance AUC value. The result is dumped into a DVC metric file so that
we can compare it with other experiments later.
- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more
features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time
to illustrate how DVC can reuse cached files and detect changes along the
computational graph.
computational graph, regenerating the model with the updated data.
- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
model.

There are two additional tags:

- `baseline-experiment` - the first end-to-end result that we performance metric
- `baseline-experiment`: First end-to-end result that we have performance metric
for.
- `bigrams-experiment` - second version of the experiment.
- `bigrams-experiment`: Second experiment (model trained using bigrams
features).

Both these tags could be used to illustrate `-a` or `-T` options across
different [DVC commands](https://man.dvc.org/).
These tags can be used to illustrate `-a` or `-T` options across different
[DVC commands](https://man.dvc.org/).

## Project Structure
## Project structure

The data files, DVC-files, and results change as stages are created one by one,
but right after you for Git clone and [`dvc pull`](https://man.dvc.org/pull) to
download files that are under DVC control, the structure of the project should
look like this:
The data files, DVC-files, and results change as stages are created one by one.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:

```console
$ tree
Expand Down
Loading