diff --git a/README.md b/README.md index 5eccd689..12dd66af 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,11 @@ $ cd example-get-started $ ./deploy.sh ``` - +### dataset-registry + +- `generate.sh`: Generates the `dataset-registry` DVC project from scratch. This + project is used by **example-get-started** below, so it should be generated + first. ### example-get-started diff --git a/dataset-registry/code/README.md b/dataset-registry/code/README.md new file mode 100644 index 00000000..9adb363a --- /dev/null +++ b/dataset-registry/code/README.md @@ -0,0 +1,73 @@ +# DVC Dataset Registry + +This is an auto-generated repository for use in https://dvc.org/doc/. Please +report any issues in its source project, +[example-repos-dev](https://github.com/iterative/example-repos-dev). + +_Dataset Registry_ is a centralized place to manage raw data files for use in +other example DVC projects, such as +https://github.com/iterative/example-get-started. + +## Installation + +Start by cloning the project: + +```console +$ git clone https://github.com/iterative/dataset-registry +$ cd dataset-registry +``` + +This DVC project comes with a preconfigured DVC +[remote storage](https://man.dvc.org/remote) to hold all of the datasets. This +is a read-only HTTP remote. + +```console +$ dvc remote list +storage https://remote.dvc.org/dataset-registry +``` + +You can run [`dvc pull`](https://man.dvc.org/pull) to download specific datasets +locally: + +```console +$ dvc pull -r storage get-started/data.xml +``` + +## Testing data synchronization locally + +If you'd like to test commands like [`dvc push`](https://man.dvc.org/push), +that require write access to the remote storage, the easiest way would be to set +up a "local remote" on your file system: + +> This kind of remote is located in the local file system, but is external to +> the DVC project. + +```console +$ mkdir -P /tmp/dvc-storage +$ dvc remote add local /tmp/dvc-storage +``` + +You should now be able to run: + +```console +$ dvc push -r local +``` + +## Datasets + +The folder structure of this project groups datasets corresponding to the +external projects they pertain to. +After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data +under DVC control, the workspace should look like this: + + +```console +$ tree +. +├── README.md +└── get-started + ├── data.xml # Dataset used in iterative/example-get-started + └── data.xml.dvc + +1 directory, 3 files +``` diff --git a/dataset-registry/generate.sh b/dataset-registry/generate.sh new file mode 100755 index 00000000..c9defc55 --- /dev/null +++ b/dataset-registry/generate.sh @@ -0,0 +1,80 @@ +#!/bin/sh + +# Setup script env + +# e Exit immediately if a command exits with a non-zero exit status. +# u Treat unset variables as an error when substituting. +# x Print commands and their arguments as they are executed. +set -eux + +HERE="$( cd "$(dirname "$0")" ; pwd -P )" +REPO_NAME="dataset-registry" +REPO_PATH="$HERE/build/$REPO_NAME" + +if [ -d "$REPO_PATH" ]; then + echo "Repo $REPO_PATH already exists, remove it first." + exit 1 +fi + +mkdir -p $REPO_PATH +pushd $REPO_PATH + +# Create virtualenv, install `dvc`, initialize/config DVC project + +virtualenv -p python3 .env +export VIRTUAL_ENV_DISABLE_PROMPT=true +source .env/bin/activate +echo '.env/' >> .gitignore + +pip install dvc[s3] + +git init +dvc init + +# Remote active on this environment only for writing to HTTP redirect below. +dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry + +# Actual remote for generated project (read-only). Redirect of S3 bucket below. +dvc remote add -d storage https://remote.dvc.org/dataset-registry + +cp $HERE/code/README.md $REPO_PATH + +git add . +git commit -m "Init & config DVC project, add README" + +# Get Started + +mkdir get-started +wget https://data.dvc.org/get-started/data.xml -O get-started/data.xml +dvc add get-started/data.xml +git add get-started/.gitignore get-started/data.xml.dvc +git commit -m "Add Get Started dataset" +dvc push + +# TODO: Gather more datasets! + +popd + +echo "`cat < Note that this project +[imports](https://dvc.org/doc/commands-reference/import) a dataset from +https://github.com/iterative/dataset-registry. + The idea of the project is a simplified version of the [Tutorial](https://dvc.org/doc/tutorial). It explores the natural language processing (NLP) problem of predicting tags for a given StackOverflow question. @@ -34,8 +38,10 @@ $ source .env/bin/activate $ pip install -r src/requirements.txt ``` -This DVC project comes with a preconfigured remote DVC storage that has raw data -(input), intermediate, and final results that are produced. +This DVC project comes with a preconfigured DVC +[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw +data (input), intermediate, and final results that are produced. This is a +read-only HTTP remote. ```console $ dvc remote list @@ -48,7 +54,7 @@ You can run [`dvc pull`](https://man.dvc.org/pull) to download the data: $ dvc pull -r storage ``` -## Running in Your Environment +## Running in your environment Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the [pipeline](https://dvc.org/doc/commands-reference/pipeline): @@ -80,52 +86,54 @@ You should now be able to run: $ dvc push -r local ``` -## Existing Stages +## Existing stages This project with the help of the Git tags reflects the sequence of actions that are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel free to checkout one of them and play with the DVC commands having the playground ready. -- `0-empty` - empty Git repository. -- `1-initialize` - DVC has been initialized. The `.dvc` with the cache directory +- `0-empty`: Empty Git repository initialized. +- `1-initialize`: DVC has been initialized. `.dvc/` with the cache directory created. -- `2-remote` - remote HTTP storage initialized. It is a shared read only storage +- `2-remote`: Remote HTTP storage initialized. It's a shared read only storage that contains all data artifacts produced during next steps. -- `3-add-file` - raw data file `data.xml` downloaded and put under DVC - control with [`dvc add`](https://man.dvc.org/add). First `.dvc` meta-file - created. -- `4-source` - source code downloaded and put under Git control. -- `5-preparation` - first DVC stage created using +- `3-add-file`: Raw data file `data.xml` downloaded and put under DVC control + with [`dvc add`](https://man.dvc.org/add). First DVC-file (`.dvc` file + extension) created. +- `4-source`: Source code downloaded and put under Git control. +- `5-preparation`: First stage file (DVC-file) created using [`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV. -- `6-featurization` - feature extraction step added. It also includes the split - step for simplicity. It takes data in TSV format and produces two `.pkl` files - that contain serialized feature matrices. -- `7-train` - the model training stage added. It produces `model.pkl` file - the - actual result that can be then deployed somewhere and classify questions. -- `8-evaluate` - evaluate stage, we run it on a test dataset to see the AUC - value for the model. The result is dumped into a DVC metric file so that we - can compare it with other experiments later. -- `9-bigrams` - bigrams experiment, code has been modified to extract more +- `6-featurization`: Feature extraction stage created. It takes data in TSV + format and produces two `.pkl` files that contain serialized feature matrices. +- `7-train`: Model training stage created. It produces `model.pkl` file – the + actual result that can then get deployed to an app that implements NLP + classification. +- `8-evaluate`: Evaluation stage. Runs the model on a test dataset to produce + its performance AUC value. The result is dumped into a DVC metric file so that + we can compare it with other experiments later. +- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time to illustrate how DVC can reuse cached files and detect changes along the - computational graph. + computational graph, regenerating the model with the updated data. +- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based + model. There are two additional tags: -- `baseline-experiment` - the first end-to-end result that we performance metric +- `baseline-experiment`: First end-to-end result that we have performance metric for. -- `bigrams-experiment` - second version of the experiment. +- `bigrams-experiment`: Second experiment (model trained using bigrams + features). -Both these tags could be used to illustrate `-a` or `-T` options across -different [DVC commands](https://man.dvc.org/). +These tags can be used to illustrate `-a` or `-T` options across different +[DVC commands](https://man.dvc.org/). -## Project Structure +## Project structure -The data files, DVC-files, and results change as stages are created one by one, -but right after you for Git clone and [`dvc pull`](https://man.dvc.org/pull) to -download files that are under DVC control, the structure of the project should -look like this: +The data files, DVC-files, and results change as stages are created one by one. +After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data +under DVC control, the workspace should look like this: ```console $ tree diff --git a/example-get-started/generate.sh b/example-get-started/generate.sh index 519bfd30..bb9c294d 100755 --- a/example-get-started/generate.sh +++ b/example-get-started/generate.sh @@ -1,5 +1,7 @@ #!/bin/sh +# Setup script env + # e Exit immediately if a command exits with a non-zero exit status. # u Treat unset variables as an error when substituting. # x Print commands and their arguments as they are executed. @@ -17,51 +19,66 @@ fi mkdir -p $REPO_PATH pushd $REPO_PATH -git init +# Initialize Git repo virtualenv -p python3 .env export VIRTUAL_ENV_DISABLE_PROMPT=true source .env/bin/activate echo '.env/' >> .gitignore +git init git add . git commit -m "Initialize Git repository" git tag -a "0-empty" -m "Git initialized" +# https://dvc.org/doc/get-started/install + pip install dvc[s3] +# https://dvc.org/doc/get-started/initialize + dvc init git commit -m "Initialize DVC project" git tag -a "1-initialize" -m "DVC initialized." +# https://dvc.org/doc/get-started/configure + # Remote active on this environment only for writing to HTTP redirect above. dvc remote add -d --local storage s3://dvc-public/remote/get-started # Actual remote for generated project (read-only). Redirect of S3 bucket below. dvc remote add -d storage https://remote.dvc.org/get-started +cp $HERE/code/README.md $REPO_PATH + git add . -git commit -m "Configure default HTTP remote (read-only)" +git commit -m "Configure default HTTP remote (read-only), add README" git tag -a "2-remote" -m "Read-only remote storage configured." -mkdir data -wget https://data.dvc.org/get-started/data.xml -O data/data.xml -dvc add data/data.xml +# https://dvc.org/doc/get-started/add-files + +mkdir data && cd data +dvc import https://github.com/iterative/dataset-registry \ + get-started/data.xml +cd .. git add data/.gitignore data/data.xml.dvc git commit -m "Add raw data to project" git tag -a "3-add-file" -m "Data file added." -dvc push +dvc push # https://dvc.org/doc/get-started/share-data + +# https://dvc.org/doc/get-started/connect-code-and-data wget https://code.dvc.org/get-started/code.zip unzip code.zip rm -f code.zip -cp $HERE/code/README.md $REPO_PATH git add . git commit -m "Add source code files to repo" git tag -a "4-sources" -m "Source code added." pip install -r src/requirements.txt +# https://dvc.org/doc/get-started/connect-code-and-data#create-a-first-data-transformation-stage + dvc run -f prepare.dvc \ -d src/prepare.py -d data/data.xml \ -o data/prepared \ @@ -71,6 +88,8 @@ git commit -m "Create data preparation stage" git tag -a "5-preparation" -m "First pipeline stage (data preparation) created." dvc push +# https://dvc.org/doc/get-started/pipeline + dvc run -f featurize.dvc \ -d src/featurization.py -d data/prepared \ -o data/features \ @@ -90,6 +109,8 @@ git commit -m "Create training stage" git tag -a "7-train" -m "Training stage created." dvc push +# https://dvc.org/doc/get-started/metrics + dvc run -f evaluate.dvc \ -d src/evaluate.py -d model.pkl -d data/features \ -M auc.metric \ @@ -100,6 +121,8 @@ git tag -a "baseline-experiment" -m "Baseline experiment evaluation" git tag -a "8-evaluation" -m "Baseline evaluation stage created." dvc push +# https://dvc.org/doc/get-started/experiments + sed -e s/max_features=5000\)/max_features=6000\,\ ngram_range=\(1\,\ 2\)\)/ -i "" \ src/featurization.py @@ -107,6 +130,8 @@ dvc repro train.dvc git commit -am "Reproduce model using bigrams" git tag -a "9-bigrams-model" -m "Model retrained using bigrams." +# https://dvc.org/doc/get-started/compare-experiments + dvc repro evaluate.dvc git commit -am "Evaluate bigrams model" git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation" @@ -132,9 +157,10 @@ cd build/example-get-started git remote add origin git@github.com:iterative/example-get-started.git git push --force origin master git push --force origin --tags +cd ../.. You may remove the generated repo with: -rm -fR build/ +rm -fR build `"