Merge pull request #35 from iterative/iesahin/issue28

Fixes for Data Access Scenario
iterative · Mar 9, 2021 · 9209c3c · 9209c3c
2 parents ef5b505 + e7fac80
commit 9209c3c
Show file tree

Hide file tree

Showing 13 changed files with 158 additions and 163 deletions.
diff --git a/get-started/accessing/01-download.md b/get-started/accessing/01-download.md
@@ -1,19 +1,42 @@
 # Download
 
-Let's first get/download any file that was added to DVC:
-
-> You don't need to be inside a Git or DVC repo to execute it
+We can download any file in a DVC repository:
 
 ```
 dvc get \
   https://github.com/iterative/dataset-registry \
   get-started/data.xml
 ```{{execute}}
 
-`ls data.xml`{{execute}}
-
 `md5sum data.xml`{{execute}}
 
-Here we see that instead of accessing data file directly (e.g. with `aws s3 cp`,
-or `scp`, `wget`, etc) we are accessing it using a Git repo URL as an _entry
-point_ or as a _data/model registry_.
+`dvc get` automated this by reading `https://remote.dvc.org/dataset-registry`
+from 
+[.dvc/config](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)
+and `a3/04afb96060aad90176268345e10355` path from
+[get-started/data.xml.dvc](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc).
+
+Just for fun, let's try to download it with `wget`:
+
+```
+storage="https://remote.dvc.org/dataset-registry"
+path="a3/04afb96060aad90176268345e10355"
+wget -O data.xml.1 $storage/$path
+```{{execute}}
+
+Check whether they are the same file:
+
+`diff data.xml data.xml.1`{{execute}}
+
+Instead of downloading the data file directly, e.g., with `aws s3 cp`, `scp`,
+`wget`, we are accessing it using a Git repo URL as an _entry point_ or as
+a [_data/model registry_][data-registries].
+
+[data-registries]: https://dvc.org/doc/use-cases/data-registries
+
+By the way, we didn't initialize DVC in the current directory yet. `dvc get`
+doesn't need an initialized project. 
+
+Let's initialize DVC now. 
+
+`dvc init`{{execute}}
diff --git a/get-started/accessing/02-discovering-files.md b/get-started/accessing/02-discovering-files.md
@@ -0,0 +1,20 @@
+# Discovering files
+
+As we mentioned, if you look at the [repository][dr], you won't see
+`data/data.xml` or `model.pkl`, or any DVC-tracked files. They are not stored
+in Git. We can `dvc get` them, but how do we even know what data is tracked in a
+remote DVC repo before accessing it?
+
+[dr]: https://github.com/iterative/dataset-registry
+
+If `dvc get` is the analog of `wget` or `curl`, then `dvc list` is the analog
+of `ls` or `aws s3 ls` and similar commands:
+
+```
+dvc list \
+  https://github.com/iterative/example-get-started \
+  data/
+```{{execute}}
+
+The only difference is that we pass a Git URL. Same interface as `dvc get`. Now
+we can see `data.xml` as well as regular Git files.
diff --git a/get-started/accessing/02-how-does-it-work.md b/get-started/accessing/02-how-does-it-work.md
diff --git a/get-started/accessing/03-discovering-files.md b/get-started/accessing/03-discovering-files.md
diff --git a/get-started/accessing/03-python-api.md b/get-started/accessing/03-python-api.md
@@ -0,0 +1,32 @@
+# Python API
+
+Besides using DVC commands in the command line, we can also access any
+DVC-tracked artifact "natively" from Python with 
+[the API](https://dvc.org/doc/api-reference):
+
+`process.py`{{open}}
+
+The script downloads the data like `dvc get` and counts the number of lines in it: 
+
+`python3 process.py`{{execute}}
+
+The interface of [`dvc.api.open`][apiopen] is similar to the one we've
+seen already. It receives Git repo URL and path as arguments, and works
+the same way. There are also a few important differences:
+
+[apiopen]: https://dvc.org/doc/api-reference/open
+
+- **It's Python "native"**, we don't have to combine CLI scripts with Python
+  code to process data or deploy a model.
+
+- **It doesn't consume space for a file on the file system** - `open()` doesn't
+  consume space in the file system - it loads the data into the memory as
+  needed.  If you want to process a large dataset or deploy a huge model you
+  don't have to keep it on the disk.
+
+- **It reads data lazily** - it doesn't allocate a huge array internally to hold
+  the data, instead it streams it from the remote storage. This means you can
+  process a huge dataset with a very low memory footprint.
+
+- **It unifies storage access** - it doesn't matter if actual data is stored on
+  S3, or Google Cloud, or SSH - the interface is the same.
diff --git a/get-started/accessing/04-python-api.md b/get-started/accessing/04-python-api.md
diff --git a/get-started/accessing/04-reusing-data-or-models.md b/get-started/accessing/04-reusing-data-or-models.md
@@ -0,0 +1,36 @@
+# Reusing (importing) data or models
+
+Modern programming languages have a way to package, distribute, and reuse code
+as libraries (in the Python world - [PyPI](https://pypi.org/), `pip`, `conda`
+are some well-known examples). It's an important feature that significantly
+simplifies managing the complexity of the development process.
+
+What about datasets and ML models?
+
+A DVC repository and the `dvc import` command are enough to export data and models,
+reuse them, track upstream changes, etc. Let's give it a try:
+
+```
+dvc import \
+  https://github.com/iterative/dataset-registry \
+  get-started/data.xml -o data/data.xml
+```{{execute}}
+
+`dvc import` command creates `data/data.xml.dvc` to track the dependency. You
+can view this file in the editor: 
+
+`data/data.xml.dvc`{{open}}
+
+The `url` and `rev_lock` subfields under `repo` are used to save the origin and
+the version of the dependency, respectively:
+
+The effect of using `dvc import` is similar to running `dvc get` + `dvc add`,
+but the resulting `.dvc` file includes metadata to track changes in the source
+repository. This allows you to bring in changes from the data source later,
+using:
+
+`dvc update data/data.xml.dvc`{{execute}}
+
+In this case, everything is up to date, but if someone creates a new version of
+`data.xml` in the data registry, this command can detect the change and update the
+`data/data.xml` file accordingly.
diff --git a/get-started/accessing/05-reusing-data-or-models.md b/get-started/accessing/05-reusing-data-or-models.md
diff --git a/get-started/accessing/06-congrats.md b/get-started/accessing/06-congrats.md
@@ -0,0 +1,17 @@
+# Congratulations!
+
+In this scenario, we learned how to access data registries via DVC. We can
+download model and data files with `dvc get` or import them to DVC repositories
+with `dvc import`. DVC also has an API that streams large files directly into
+the memory with `dvc.api.open`. 
+
+Our vision is to have a central registry for all the data and model files and
+using them in different projects. It's based on Git, and provides flexibility
+without requiring additional infrastructure. 
+
+<p align="center">
+<img src="/dvc/courses/get-started/accessing/assets/data-registry.png">
+</p>
+
+If you want to read more about the workflow, refer to the
+[Data Registries](https://dvc.org/doc/use-cases/data-registries) use case.
diff --git a/get-started/accessing/06-data-model-and-artifact.md b/get-started/accessing/06-data-model-and-artifact.md
diff --git a/get-started/accessing/index.json b/get-started/accessing/index.json
@@ -27,7 +27,7 @@
             },
             {
                 "title": "Step 6",
-                "text": "06-data-model-and-artifact.md"
+                "text": "06-congrats.md"
             }
         ],
         "intro": {
@@ -47,9 +47,10 @@
         }
     },
     "environment": {
-        "uilayout": "terminal"
+        "uieditorpath": "/root/project",
+        "uilayout": "vscode-terminal-split"
     },
     "backend": {
         "imageid": "ubuntu:2004"
     }
-}
+}
diff --git a/get-started/accessing/init.sh b/get-started/accessing/init.sh
@@ -18,11 +18,14 @@ until hash dvc &>/dev/null; do sleep 1; done
 # enable bash completion
 source /etc/bash_completion
 
-git clone --branch 3-config-remote \
-    https://github.com/iterative/example-get-started.git
-cd example-get-started/
-git reset –-hard 3-config-remote
-cd ..
+# git clone --branch 3-config-remote \
+#     https://github.com/iterative/example-get-started.git
+# cd example-get-started/
+# git reset –-hard 3-config-remote
+# cd ..
+
+git init project
+cd project
 
 # clear screen
 clear

diff --git a/get-started/accessing/install.sh b/get-started/accessing/install.sh
@@ -1,13 +1,17 @@
 #!/bin/bash
 
-apt install --yes highlight virtualenv
+apt install --yes highlight 
 
 # install dvc
-sudo wget https://dvc.org/deb/dvc.list \
-          -O /etc/apt/sources.list.d/dvc.list
-sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \
-    -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
-sudo apt install dvc
+# sudo wget https://dvc.org/deb/dvc.list \
+#           -O /etc/apt/sources.list.d/dvc.list
+# sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/dvc.list" \
+#     -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"
+# sudo apt install dvc
+
+# installing from pip is faster 
+
+pip3 install dvc
 
 # install bash completion for dvc
 dvc completion -s bash > /etc/bash_completion.d/dvc